mirror of https://github.com/torvalds/linux.git
1478 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
ae98ae2a50 |
btrfs: rename functions to allocate and free extent maps
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2e871330ce |
btrfs: rename extent map functions to get block start, end and check if in tree
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
962162ffa6 |
btrfs: rename exported extent map compression functions
These functions are exported and don't have a 'btrfs_' prefix in their names, which goes against coding style conventions. Rename them to have such prefix, making it clear they are from btrfs and avoiding potential collisions in the future with functions defined elsewhere outside btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b351161f4f |
btrfs: rename free_extent_state() to include a btrfs prefix
This is an exported function so it should have a 'btrfs_' prefix by convention, to make it clear it's btrfs specific and to avoid collisions with functions from elsewhere in the kernel. Rename the function to add 'btrfs_' prefix to it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f81c2aea71 |
btrfs: rename the functions to count, test and get bit ranges in io trees
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a 'btrfs_' prefix to their names to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9d222562b4 |
btrfs: rename the functions to clear bits for an extent range
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. One of them has a double underscore prefix which is also discouraged. So remove double underscore prefix where applicable and add a 'btrfs_' prefix to their name to make it clear they are from btrfs. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
242570e80b |
btrfs: add btrfs prefix to main lock, try lock and unlock extent functions
These functions are exported so they should have a 'btrfs_' prefix by convention, to make it clear they are btrfs specific and to avoid collisions with functions from elsewhere in the kernel. So add a prefix to their name. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2b14b74b99 |
btrfs: use folio_contains() for EOF detection
Currently we use the following pattern to detect if the folio contains
the end of a file:
if (folio->index == end_index)
folio_zero_range();
But that only works if the folio is page sized.
For the following case, it will not work and leave the range beyond EOF
uninitialized:
The page size is 4K, and the fs block size is also 4K.
16K 20K 24K
| | | |
|
EOF at 22K
And we have a large folio sized 8K at file offset 16K.
In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.
Fix the following call sites which use the above pattern:
- add_ra_bio_pages()
- extent_writepage()
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
e1fcad644b |
btrfs: remove unnecessary early exits in delalloc folio lock and unlock
Inside functions unlock_delalloc_folio() and lock_delalloc_folios(), we have the following early exits: if (index == locked_folio->index && end_index == index) return; This allows us to exit early if the range is inside the same locked folio. However the current check relies on page sized folios, if we have a large folio that contains @index but not at @index, then the early exit will no longer trigger. Furthermore without the above early check, the existing code can handle it well, as both __process_folios_contig() and lock_delalloc_folios() will skip any folio page lock/unlock if it's on the locked folio. Here we remove the early exits and let the existing code handle the same index case, to make the code a little simpler. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c08d45de63 |
btrfs: prepare end_bbio_data_write() for large data folios
The function is doing an ASSERT() checking the folio order, but all later functions are handling large folios properly, thus we can safely remove that ASSERT(). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c757c024fc |
btrfs: use clear_extent_bit() at try_release_extent_state()
Instead of using __clear_extent_bit() we can use clear_extent_bit() since we pass a NULL value for the changeset argument. Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c4669e4a8b |
btrfs: pass a pointer to get_range_bits() to cache first search result
Allow get_range_bits() to take an extent state pointer to pointer argument so that we can cache the first extent state record in the target range, so that a caller can use it for subsequent operations without doing a full tree search. Currently the only user is try_release_extent_state(), which then does a call to __clear_extent_bit() which can use such a cached state record. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
32c523c578 |
btrfs: allow folios to be released while ordered extent is finishing
When the release_folio callback (from struct address_space_operations) is
invoked we don't allow the folio to be released if its range is currently
locked in the inode's io_tree, as it may indicate the folio may be needed
by the task that locked the range.
However if the range is locked because an ordered extent is finishing,
then we can safely allow the folio to be released because ordered extent
completion doesn't need to use the folio at all.
When we are under memory pressure, the kernel starts writeback of dirty
pages (folios) with the goal of releasing the pages from the page cache
after writeback completes, however this often is not possible on btrfs
because:
* Once the writeback completes we queue the ordered extent completion;
* Once the ordered extent completion starts, we lock the range in the
inode's io_tree (at btrfs_finish_one_ordered());
* If the release_folio callback is called while the folio's range is
locked in the inode's io_tree, we don't allow the folio to be
released, so the kernel has to try to release memory elsewhere,
which may result in triggering more writeback or releasing other
pages from the page cache which may be more useful to have around
for applications.
In contrast, when the release_folio callback is invoked after writeback
finishes and before ordered extent completion starts or locks the range,
we allow the folio to be released, as well as when the release_folio
callback is invoked after ordered extent completion unlocks the range.
Improve on this by detecting if the range is locked for ordered extent
completion and if it is, allow the folio to be released. This detection
is achieved by adding a new extent flag in the io_tree that is set when
the range is locked during ordered extent completion.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
cbfb4cbf45 |
btrfs: update comment for try_release_extent_state()
Drop reference to pages from the comment since the function is fully folio aware and works regardless of how many pages are in the folio. Also while at it, capitalize the first word and make it more explicit that release_folio is a callback from struct address_space_operations. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
062f3d02a2 |
btrfs: remove unused flag EXTENT_BUFFER_IN_TREE
This flag is set after inserting the eb to the buffer tree and cleared on it's removal. It was added in commit |
|
|
|
40f47f6d72 |
btrfs: remove unused flag EXTENT_BUFFER_READ_ERR
This flag was added by commit |
|
|
|
38e541051e |
btrfs: open code folio_index() in btree_clear_folio_dirty_tag()
The folio_index() helper is only needed for mixed usage of page cache and swap cache, for pure page cache usage, the caller can just use folio->index instead. It can't be a swap cache folio here. Swap mapping may only call into fs through 'swap_rw' but btrfs does not use that method for swap. Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Kairui Song <kasong@tencent.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e08e49d986 |
btrfs: adjust subpage bit start based on sectorsize
When running machines with 64k page size and a 16k nodesize we started
seeing tree log corruption in production. This turned out to be because
we were not writing out dirty blocks sometimes, so this in fact affects
all metadata writes.
When writing out a subpage EB we scan the subpage bitmap for a dirty
range. If the range isn't dirty we do
bit_start++;
to move onto the next bit. The problem is the bitmap is based on the
number of sectors that an EB has. So in this case, we have a 64k
pagesize, 16k nodesize, but a 4k sectorsize. This means our bitmap is 4
bits for every node. With a 64k page size we end up with 4 nodes per
page.
To make this easier this is how everything looks
[0 16k 32k 48k ] logical address
[0 4 8 12 ] radix tree offset
[ 64k page ] folio
[ 16k eb ][ 16k eb ][ 16k eb ][ 16k eb ] extent buffers
[ | | | | | | | | | | | | | | | | ] bitmap
Now we use all of our addressing based on fs_info->sectorsize_bits, so
as you can see the above our 16k eb->start turns into radix entry 4.
When we find a dirty range for our eb, we correctly do bit_start +=
sectors_per_node, because if we start at bit 0, the next bit for the
next eb is 4, to correspond to eb->start 16k.
However if our range is clean, we will do bit_start++, which will now
put us offset from our radix tree entries.
In our case, assume that the first time we check the bitmap the block is
not dirty, we increment bit_start so now it == 1, and then we loop
around and check again. This time it is dirty, and we go to find that
start using the following equation
start = folio_start + bit_start * fs_info->sectorsize;
so in the case above, eb->start 0 is now dirty, and we calculate start
as
0 + 1 * fs_info->sectorsize = 4096
4096 >> 12 = 1
Now we're looking up the radix tree for 1, and we won't find an eb.
What's worse is now we're using bit_start == 1, so we do bit_start +=
sectors_per_node, which is now 5. If that eb is dirty we will run into
the same thing, we will look at an offset that is not populated in the
radix tree, and now we're skipping the writeout of dirty extent buffers.
The best fix for this is to not use sectorsize_bits to address nodes,
but that's a larger change. Since this is a fs corruption problem fix
it simply by always using sectors_per_node to increment the start bit.
Fixes:
|
|
|
|
ebaa602d52 |
btrfs: prepare extent_io.c for future large folio support
When we're handling folios from filemap, we can no longer assume all folios are page sized. Thus for call sites assuming the folio is page sized, change the PAGE_SIZE usage to folio_size() instead. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cb3c11d2f5 |
btrfs: add a size parameter to btrfs_alloc_subpage()
Since we can no longer assume page sized folio for data filemap folios, allow btrfs_alloc_subpage() to accept a new parameter, @fsize, indicating the folio size. This doesn't follow the regular behavior of passing a folio directly, because this function is shared by both data and metadata folios, and for metadata folios we have extra allocation policy to ensure no large folios whose sizes are larger than nodesize (unless it's page sized). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4c14d5c855 |
btrfs: subpage: make btrfs_is_subpage() check against a folio
To support large data folios, we can no longer assume every filemap folio is page sized. So btrfs_is_subpage() check must be done against a folio. Thankfully for metadata folios, we have the full control and ensure a large folio will not be large than nodesize, so btrfs_meta_is_subpage() doesn't need this change. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7ef3cbf17d |
btrfs: avoid linker error in btrfs_find_create_tree_block()
The inline function btrfs_is_testing() is hardcoded to return 0 if CONFIG_BTRFS_FS_RUN_SANITY_TESTS is not set. Currently we're relying on the compiler optimizing out the call to alloc_test_extent_buffer() in btrfs_find_create_tree_block(), as it's not been defined (it's behind an #ifdef). Add a stub version of alloc_test_extent_buffer() to avoid linker errors on non-standard optimization levels. This problem was seen on GCC 14 with -O0 and is helps to see symbols that would be otherwise optimized out. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Mark Harmstone <maharmstone@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7ca3e84980 |
btrfs: reject out-of-band dirty folios during writeback
[OUT-OF-BAND DIRTY FOLIOS] An out-of-band folio means the folio is marked dirty but without notifying the filesystem. This can lead to various problems, not limited to: - No folio::private to track per block status - No proper space reserved for such a dirty folio [HISTORY IN BTRFS] This used to be a problem related to get_user_page(), but with the introduction of pin_user_pages*(), we should no longer hit such case anymore. In btrfs, we have a long history of catching such out-of-band dirty folios by: - Mark the folio ordered during delayed allocation - Check the folio ordered flag during writeback If the folio has no ordered flag, it means it doesn't go through delayed allocation, thus it's definitely an out-of-band one. If we got one, we go through COW fixup, which will re-dirty the folio with proper handling in another workqueue. [PROBLEMS OF COW-FIXUP] Such workaround is a blockage for us to migrate to iomap (it requires extra flags to trace if a folio is dirtied by the fs or not) and I'd argue it's not data checksum safe, since if a folio can be marked dirty without informing the fs, the content can also change at any time. But with the introduction of pin_user_pages*() during v5.8 merge window, such out-of-band dirty folio such be treated as a bug. Ext4 has treated such case by warning and erroring out even before pin_user_pages*(). Furthermore, there are already proofs that such folio ordered flag tracking can be screwed up by incorrect error handling, check the commit messages of the following commits: |
|
|
|
0d31ca6584 |
btrfs: allow buffered write to avoid full page read if it's block aligned
[BUG] Since the support of block size (sector size) < page size for btrfs, test case generic/563 fails with 4K block size and 64K page size: --- tests/generic/563.out 2024-04-25 18:13:45.178550333 +0930 +++ /home/adam/xfstests-dev/results//generic/563.out.bad 2024-09-30 09:09:16.155312379 +0930 @@ -3,7 +3,8 @@ read is in range write is in range write -> read/write -read is in range +read has value of 8388608 +read is NOT in range -33792 .. 33792 write is in range ... [CAUSE] The test case creates a 8MiB file, then does buffered write into the 8MiB using 4K block size, to overwrite the whole file. On 4K page sized systems, since the write range covers the full block and page, btrfs will not bother reading the page, just like what XFS and EXT4 do. But on 64K page sized systems, although the 4K sized write is still block aligned, it's not page aligned anymore, thus btrfs will read the full page, which will be accounted by cgroup and fail the test. As the test case itself expects such 4K block aligned write should not trigger any read. Such expected behavior is an optimization to reduce folio reads when possible, and unfortunately btrfs does not implement such optimization. [FIX] To skip the full page read, we need to do the following modification: - Do not trigger full page read as long as the buffered write is block aligned This is pretty simple by modifying the check inside prepare_uptodate_page(). - Skip already uptodate blocks during full page read Or we can lead to the following data corruption: 0 32K 64K |///////| | Where the file range [0, 32K) is dirtied by buffered write, the remaining range [32K, 64K) is not. When reading the full page, since [0,32K) is only dirtied but not written back, there is no data extent map for it, but a hole covering [0, 64k). If we continue reading the full page range [0, 64K), the dirtied range will be filled with 0 (since there is only a hole covering the whole range). This causes the dirtied range to get lost. With this optimization, btrfs can pass generic/563 even if the page size is larger than fs block size. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b2e743927f |
btrfs: make btrfs_do_readpage() to do block-by-block read
Currently if btrfs has its block size (the older sector size) smaller than the page size, btrfs_do_readpage() will handle the range extent by extent, this is good for performance as it doesn't need to re-lookup the same extent map again and again. (Although get_extent_map() already does extra cached em check, thus the optimization is not that obvious.) This is totally fine and is a valid optimization, but it has an assumption that there is no partial uptodate range in the page. Meanwhile there is an incoming feature, requiring btrfs to skip the full page read if a buffered write range covers a full block but not a full page. In that case, we can have a page that is partially uptodate, and the current per-extent lookup cannot handle such case. So here we change btrfs_do_readpage() to do block-by-block read, this simplifies the following things: - Remove the need for @iosize variable Because we just use sectorsize as our increment. - Remove @pg_offset, and calculate it inside the loop when needed It's just offset_in_folio(). - Use a for() loop instead of a while() loop This will slightly reduce the read performance for subpage cases, but for the future where we need to skip already uptodate blocks, it should still be worth. For block size == page size, this brings no performance change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d2da21a6e0 |
btrfs: introduce a read path dedicated extent lock helper
Currently we're using btrfs_lock_and_flush_ordered_range() for both
btrfs_read_folio() and btrfs_readahead(), but it has one critical
problem for future subpage optimizations:
- It will call btrfs_start_ordered_extent() to writeback the involved
folios
But remember we're calling btrfs_lock_and_flush_ordered_range() at
read paths, meaning the folio is already locked by read path.
If we really trigger writeback for those already locked folios, this
will lead to a deadlock and writeback cannot get the folio lock.
Such dead lock is prevented by the fact that btrfs always keeps a
dirty folio also uptodate, by either dirtying all blocks of the folio,
or by reading the whole folio before dirtying.
To prepare for the incoming patch which allows btrfs to skip full folio
read if the buffered write is block aligned, we have to start by solving
the possible deadlock first.
Instead of blindly calling btrfs_start_ordered_extent(), introduce a
new helper, which is smarter in the following ways:
- Only wait and flush the ordered extent if
* The folio doesn't even have private bit set
* Part of the blocks of the ordered extent are not uptodate
This can happen by:
* The folio writeback finished, then got invalidated.
There are a lot of reasons that a folio can get invalidated,
from memory pressure to direct IO (which invalidates all folios
of the range).
But OE not yet finished.
We have to wait for the ordered extent, as the OE may contain
to-be-inserted data checksum.
Without waiting, our read can fail due to the missing checksum.
But either way, the OE should not need any extra flush inside the
locked folio range.
- Skip the ordered extent completely if
* All the blocks are dirty
This happens when OE creation is caused by a folio writeback whose
file offset is before our folio.
E.g. 16K page size and 4K block size
0 8K 16K 24K 32K
|//////////////||///////| |
The writeback of folio 0 created an OE for range [0, 24K), but since
folio 16K is not fully uptodate, a read is triggered for folio 16K.
The writeback will never happen (we're holding the folio lock for
read), nor will the OE finish.
Thus we must skip the range.
* All the blocks are uptodate
This happens when the writeback finished, but OE not yet finished.
Since the blocks are already uptodate, we can skip the OE range.
The new helper lock_extents_for_read() will do a loop for the target
range by:
1) Lock the full range
2) If there is no ordered extent in the remaining range, exit
3) If there is an ordered extent that we can skip
Skip to the end of the OE, and continue checking
We do not trigger writeback nor wait for the OE.
4) If there is an ordered extent that we cannot skip
Unlock the whole extent range and start the ordered extent.
And also update btrfs_start_ordered_extent() to add two more parameters:
@nowriteback_start and @nowriteback_len, to prevent triggering flush for
a certain range.
This will allow us to handle the following case properly in the future:
16K page size, 4K btrfs block size:
0 4K 8K 12K 16K 20K 24K 28K 32K
|/////////////////////////////||////////////////| | |
|<-------------------- OE 2 ------------------->| |< OE 1 >|
The folio has been written back before, thus we have an OE at
[28K, 32K).
Although the OE 1 finished its IO, the OE is not yet removed from IO
tree.
The folio got invalidated after writeback completed and before the
ordered extent finished.
And [16K, 24K) range is dirty and uptodate, caused by a block aligned
buffered write (and future enhancements allowing btrfs to skip full
folio read for such case).
But writeback for folio 0 has began, thus it generated OE 2, covering
range [0, 24K).
Since the full folio 16K is not uptodate, if we want to read the folio,
the existing btrfs_lock_and_flush_ordered_range() will dead lock, by:
btrfs_read_folio()
| Folio 16K is already locked
|- btrfs_lock_and_flush_ordered_range()
|- btrfs_start_ordered_extent() for range [16K, 24K)
|- filemap_fdatawrite_range() for range [16K, 24K)
|- extent_write_cache_pages()
folio_lock() on folio 16K, deadlock.
But now we will have the following sequence:
btrfs_read_folio()
| Folio 16K is already locked
|- lock_extents_for_read()
|- can_skip_ordered_extent() for range [16K, 24K)
| Returned true, the range [16K, 24K) will be skipped.
|- can_skip_ordered_extent() for range [28K, 32K)
| Returned false.
|- btrfs_start_ordered_extent() for range [28K, 32K) with
[16K, 32K) as no writeback range
No writeback for folio 16K will be triggered.
And there will be no more possible deadlock on the same folio.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
bd06bce1b3 |
btrfs: use num_extent_folios() in for loop bounds
As the helper num_extent_folios() is now __pure, we can use it in for loop without storing its value in a variable explicitly, the compiler will do this for us. The effects on btrfs.ko is -200 bytes and there are stack space savings too: btrfs_clone_extent_buffer -8 (32 -> 24) btrfs_clear_buffer_dirty -8 (48 -> 40) clear_extent_buffer_uptodate -8 (40 -> 32) set_extent_buffer_dirty -8 (32 -> 24) write_one_eb -8 (88 -> 80) set_extent_buffer_uptodate -8 (40 -> 32) read_extent_buffer_pages_nowait -16 (64 -> 48) find_extent_buffer -8 (32 -> 24) Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b7226ce6c4 |
btrfs: simplify parameters of metadata folio helpers
Unlike folio helpers for date the ones for metadata always take the extent buffer start and length, so they can be simplified to take the eb only. The fs_info can be obtained from eb too so it can be dropped as parameter. Added in patch "btrfs: use metadata specific helpers to simplify extent buffer helpers". Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
db3a1bac3f |
btrfs: merge alloc_dummy_extent_buffer() helpers
After previous patch removing nodesize from parameters, __alloc_dummy_extent_buffer() and alloc_dummy_extent_buffer() are identical so we can drop one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
551561c346 |
btrfs: don't pass nodesize to __alloc_extent_buffer()
All callers pass a valid fs_info so we can read the nodesize from that instead of passing it as parameter. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fcc384be06 |
btrfs: require strict data/metadata split for subpage checks
Since we have btrfs_meta_is_subpage(), we should make btrfs_is_subpage() to be data inode specific. This change involves: - Simplify btrfs_is_subpage() Now we only need to do a very simple sectorsize check against PAGE_SIZE. And since the function is pretty simple now, just make it an inline function. - Add an extra ASSERT() to make sure btrfs_is_subpage() is only called on data inode mapping - Migrate btree_csum_one_bio() to use btrfs_meta_folio_*() helpers - Migrate alloc_extent_buffer() to use btrfs_meta_folio_*() helpers - Migrate end_bbio_meta_write() to use btrfs_meta_folio_*() helpers Or we will trigger the ASSERT() due to calling btrfs_folio_*() on metadata folios. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
67ebd7a1f1 |
btrfs: simplify subpage handling of read_extent_buffer_pages_nowait()
By using a shared bio_add_folio_nofail() with calculated range_start/range_len, so no more explicit subpage routine needed. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6c6201278e |
btrfs: simplify subpage handling of write_one_eb()
Currently inside write_one_eb() we have two different ways of handling subpage and regular metadata. The differences are: - Extra offset/length calculation when adding the folio range to bio for subpage cases - Only decrease wbc->nr_to_write if the whole page is no longer dirty for subpage cases - Use subpage helper for subpage cases Merge the tow ways into a shared one: - Always calculate the to-be-queued range So that bio_add_folio() can use the same calculated resulted length and offset for both cases. - Use btrfs_meta_folio_clear_dirty() and btrfs_meta_folio_set_writeback() helpers This will cover both cases. - Only decrease wbc->nr_to_write if the folio is no longer dirty Since we have the folio locked, no one else can modify the folio dirty flags (set_extent_buffer_dirty() will also lock the folio for subpage cases). Thus after our btrfs_meta_folio_clear_dirty() call, if the whole folio is no longer dirty, we're submitting the last dirty eb of the folio, and can decrease wbc->nr_to_write properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7895817b31 |
btrfs: simplify subpage handling of btrfs_clear_buffer_dirty()
The function btrfs_clear_buffer_dirty() is called on dirty extent buffer that will not be written back. The function will call btree_clear_folio_dirty() to clear the folio dirty flag and also clear PAGECACHE_TAG_DIRTY flag. And we split the subpage and regular handling, as for subpage cases we should only clear PAGECACHE_TAG_DIRTY if the last dirty extent buffer in the page is cleared. So here we can simplify the function by: - Use the newly introduced btrfs_meta_folio_clear_and_test_dirty() helper The helper will return true if we cleared the folio dirty flag. With that we can use the same helper for both subpage and regular cases. - Rename btree_clear_folio_dirty() to btree_clear_folio_dirty_tag() As we move the folio dirty clearing in the btrfs_clear_buffer_dirty(). - Call btrfs_meta_folio_clear_and_test_dirty() to clear the dirty flags for both regular and subpage metadata cases - Only call btree_clear_folio_dirty_tag() when the folio is no longer dirty - Update the comment inside set_extent_buffer_dirty() As there is no separate clear_subpage_extent_buffer_dirty() anymore. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ee76e5a742 |
btrfs: use metadata specific helpers to simplify extent buffer helpers
The following functions are doing metadata specific checks: - set_extent_buffer_uptodate() - clear_extent_buffer_uptodate() The reason why we do not use btrfs_folio_*() helpers for those helpers is, btrfs_is_subpage() cannot handle dummy extent buffer if nodesize >= PAGE_SIZE but block size < PAGE_SIZE. In that case, we do not need to attach extra bitmaps to the extent buffer folio. But since dummy extent buffer folios are not attached to btree inode, btrfs_is_subpage() will return true, causing problems. And the following are using btrfs_folio_*() helpers for metadata, but in theory we should use metadata specific checks: - set_extent_buffer_dirty() This is not causing problems because a dummy extent buffer should never be marked dirty. To make code simpler, introduce btrfs_meta_folio_*() helpers, to do the metadata specific handling, so that we do not to open-code such checks in above involved functions. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
57a3212674 |
btrfs: make subpage attach and detach handle metadata properly
Currently subpage attach/detach is not doing proper dummy extent buffer subpage check, as btrfs_is_subpage() is not reliable for dummy extent buffer folios. Since we have a metadata specific check now, use that for btrfs_attach_subpage() first. Then enhance btrfs_detach_subpage() to accept a type parameter, so that we can do extra checks for dummy extent buffers properly. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f64e818153 |
btrfs: factor out metadata subpage detection into a dedicated helper
Currently we have only one btrfs_is_subpage() to cover both data and metadata. But there is a special case for metadata: - dummy extent buffer, sector size < PAGE_SIZE and node size >= PAGE_SIZE In such case, btrfs_is_subpage() will return true for extent buffer folio. But that is not correct, and that's exactly why we have some open-coded checks for functions like set_extent_buffer_uptodate() and clear_extent_buffer_uptodate(). Just extract the metadata specific checks into a helper, and replace those call sites. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
619611e87f |
btrfs: remove btrfs_fs_info::sectors_per_page
For the future large folio support, our filemap can have folios with
different sizes, thus we can no longer rely on a fixed blocks_per_page
value.
To prepare for that future, here we do:
- Remove btrfs_fs_info::sectors_per_page
- Introduce a helper, btrfs_blocks_per_folio()
Which uses the folio size to calculate the number of blocks for each
folio.
- Migrate the existing btrfs_fs_info::sectors_per_page to use that
helper
There are some exceptions:
* Metadata nodesize < page size support
In the future, even if we support large folios, we will only
allocate a folio that matches our nodesize.
Thus we won't have a folio covering multiple metadata unless
nodesize < page size.
* Existing subpage bitmap dump
We use a single unsigned long to store the bitmap.
That means until we change the bitmap dumping code, our upper limit
for folio size will only be 256K (4K block size, 64 bit unsigned
long).
* btrfs_is_subpage() check
This will be migrated into a future patch.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
87d6aaf79b |
btrfs: avoid assigning twice to block_start at btrfs_do_readpage()
At btrfs_do_readpage() if we get an extent map for a prealloc extent we end up assigning twice to the 'block_start' variable, first the value returned by extent_map_block_start() and then EXTENT_MAP_HOLE. This is pointless so make it more clear by using an if-else statement and doing only one assignment. Also, while at it, move the declaration of 'block_start' into the while loop's scope, since it's not used outside of it and the related 'disk_bytenr' is also declared in this scope. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c5e8f2924a |
btrfs: remove duplicated metadata folio flag update in end_bbio_meta_read()
In that function we call set_extent_buffer_uptodate() or clear_extent_buffer_uptodate(), which will already update the uptodate flag for all the involved extent buffer folios. Thus there is no need to update the folio uptodate flags again. Just remove the open-coded part. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b9967834ab |
btrfs: update some folio related comments
Remove references to the page lock and page->mapping. Also btrfs folios can never be swizzled into swap (mentioned in extent_write_cache_pages()). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
945ce413ac |
for-6.14-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmeuSwwACgkQxWXV+ddt WDtQ8Q//fsTAu1DLjeVrzMhVNswGwr3PgWzLk3PqWTDEG9UeR/jJYntIPglVyhhP Mp2E3CYe2rlWwK0K3PITDu179tLrnCvbfKEPwWvyVZw5D0EDjPQYs9/H5ztSE8O4 4i3kv2LjlXHE3h62tjNoeHL4NK1SRJcFeH69XhhIe0ELvTQVarvfJupZwdQQivWg sDlQXklXxl1kEtHVGnmz6jd09a0vti7xw8MAG6QiIP83Hvt6Ie+NLfTfTCkRIWSK 95mPM+1YhmLQe15sD8xjHyYmH5E0cEXQh1Pvlz6xqQWRvZERG8Pmj+iwFTLaw4iA JR6sN2/KFgXE9OIGbFqQ+dvm++2hWcnPwW+h6EdOSj0DQkupbJm4VeBK0WQ4YZ+x Q0OQXPTfGpcjp7KyJrT6EZFq5VxeEfOz4hozhiCSTs+Xpx7Oh/2THL01N/dUMn0C SNR9E4/Rlq7rWV7euGwicwo/tZZIdCr4ihUGk4jpamlUbIXj+2SrOc4cpQdypmsO DeYvwzIXnPe8/Eo3rZ5ej0DK7GxfEFyd6v6l0oS6HepvMJ6y6/eiOYteVbGpvhXv J2M6PLstiZc152VHPApN9+ZlXBeGjyMfxLcsweblpSBBt/57otY6cMhqNuIp0j9B 0zP3KKOwrIJ8tzcwjMSH+2OZsDQ7oc7eiJI08r0IcpCbCBTIeyE= =0hI+ -----END PGP SIGNATURE----- Merge tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix stale page cache after race between readahead and direct IO write - fix hole expansion when writing at an offset beyond EOF, the range will not be zeroed - use proper way to calculate offsets in folio ranges * tag 'for-6.14-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix hole expansion when writing at an offset beyond EOF btrfs: fix stale page cache after race between readahead and direct IO write btrfs: fix two misuses of folio_shift() |
|
|
|
acc18e1c1d |
btrfs: fix stale page cache after race between readahead and direct IO write
After commit |
|
|
|
01af106a07 |
btrfs: fix two misuses of folio_shift()
It is meaningless to shift a byte count by folio_shift(). The folio index is in units of PAGE_SIZE, not folio_size(). We can use folio_contains() to make this work for arbitrary-order folios, so remove the assertion that the folios are of order 0. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9c5968db9e |
The various patchsets are summarized below. Plus of course many
indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_*()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable_*_dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ... |
|
|
|
6bf9b5b40a |
mm: alloc_pages_bulk: rename API
The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
|
|
|
975a6a8855 |
btrfs: add extra error messages for delalloc range related errors
All the error handling bugs I hit so far are all -ENOSPC from either: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() Previously when those functions failed, there was no error message at all, making the debugging much harder. So here we introduce extra error messages for: - cow_file_range() - run_delalloc_nocow() - submit_uncompressed_range() - writepage_delalloc() when btrfs_run_delalloc_range() failed - extent_writepage() when extent_writepage_io() failed One example of the new debug error messages is the following one: run fstests generic/750 at 2024-12-08 12:41:41 BTRFS: device fsid 461b25f5-e240-4543-8deb-e7c2bd01a6d3 devid 1 transid 8 /dev/mapper/test-scratch1 (253:4) scanned by mount (2436600) BTRFS info (device dm-4): first mount of filesystem 461b25f5-e240-4543-8deb-e7c2bd01a6d3 BTRFS info (device dm-4): using crc32c (crc32c-arm64) checksum algorithm BTRFS info (device dm-4): forcing free space tree for sector size 4096 with page size 65536 BTRFS info (device dm-4): using free-space-tree BTRFS warning (device dm-4): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-4): checking UUID tree BTRFS error (device dm-4): cow_file_range failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): run_delalloc_nocow failed, root=363 inode=412 start=503808 len=98304: -28 BTRFS error (device dm-4): failed to run delalloc range, root=363 ino=412 folio=458752 submit_bitmap=11-15 start=503808 len=98304: -28 Which shows an error from cow_file_range() which is called inside a nocow write attempt, along with the extra bitmap from writepage_delalloc(). Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8bf334beb3 |
btrfs: fix double accounting race when extent_writepage_io() failed
[BUG]
If submit_one_sector() failed inside extent_writepage_io() for sector
size < page size cases (e.g. 4K sector size and 64K page size), then
we can hit double ordered extent accounting error.
This should be very rare, as submit_one_sector() only fails when we
failed to grab the extent map, and such extent map should exist inside
the memory and has been pinned.
[CAUSE]
For example we have the following folio layout:
0 4K 32K 48K 60K 64K
|//| |//////| |///|
Where |///| is the dirty range we need to writeback. The 3 different
dirty ranges are submitted for regular COW.
Now we hit the following sequence:
- submit_one_sector() returned 0 for [0, 4K)
- submit_one_sector() returned 0 for [32K, 48K)
- submit_one_sector() returned error for [60K, 64K)
- btrfs_mark_ordered_io_finished() called for the whole folio
This will mark the following ranges as finished:
* [0, 4K)
* [32K, 48K)
Both ranges have their IO already submitted, this cleanup will
lead to double accounting.
* [60K, 64K)
That's the correct cleanup.
The only good news is, this error is only theoretical, as the target
extent map is always pinned, thus we should directly grab it from
memory, other than reading it from the disk.
[FIX]
Instead of calling btrfs_mark_ordered_io_finished() for the whole folio
range, which can touch ranges we should not touch, instead
move the error handling inside extent_writepage_io().
So that we can cleanup exact sectors that ought to be submitted but failed.
This provides much more accurate cleanup, avoiding the double accounting.
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
72dad8e377 |
btrfs: fix double accounting race when btrfs_run_delalloc_range() failed
[BUG] When running btrfs with block size (4K) smaller than page size (64K, aarch64), there is a very high chance to crash the kernel at generic/750, with the following messages: (before the call traces, there are 3 extra debug messages added) BTRFS warning (device dm-3): read-write for sector size 4096 with page size 65536 is experimental BTRFS info (device dm-3): checking UUID tree hrtimer: interrupt took |
|
|
|
ef8c0047aa |
btrfs: remove redundant variables from __process_folios_contig() and lock_delalloc_folios()
Same pattern in both functions, we really only use index, start_index is redundant. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |