mirror of https://github.com/torvalds/linux.git
10951 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
321a02db32 |
btrfs: only require sector size alignment for end_bio_extent_writepage()
Just like read page, for subpage support we only require sector size alignment. So change the error message condition to only require sector alignment. This should not affect existing code, as for regular sectorsize == PAGE_SIZE case, we are still requiring page alignment. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
60e2d25500 |
btrfs: provide btrfs_page_clamp_*() helpers
In the coming subpage RW supports, there are a lot of page status update calls which need to be converted to subpage compatible version, which needs @start and @len. Some call sites already have such @start/@len and are already in page range, like various endio functions. But there are also call sites which need to clamp the range for subpage case, like btrfs_dirty_pagse() and __process_contig_pages(). Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the clamp for subpage version. Although in theory all existing btrfs_page_*() calls can be converted to use btrfs_page_clamp_*() directly, but that would make us to do unnecessary clamp operations. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ed8f13bf4a |
btrfs: refactor page status update into process_one_page()
In __process_pages_contig() we update page status according to page_ops. That update process is a bunch of 'if' branches, which lie inside two loops, this makes it pretty hard to expand for later subpage operations. So this patch will extract these operations into its own function, process_one_pages(). Also since we're refactoring __process_pages_contig(), also move the new helper and __process_pages_contig() before the first caller of them, to remove the forward declaration. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
98af9ab12b |
btrfs: pass bytenr directly to __process_pages_contig()
As a preparation for incoming subpage support, we need bytenr passed to __process_pages_contig() directly, not the current page index. So change the parameter and all callers to pass bytenr in. With the modification, here we need to replace the old @index_ret with @processed_end for __process_pages_contig(), but this brings a small problem. Normally we follow the inclusive return value, meaning @processed_end should be the last byte we processed. If parameter @start is 0, and we failed to lock any page, then we would return @processed_end as -1, causing more problems for __unlock_for_delalloc(). So here for @processed_end, we use two different return value patterns. If we have locked any page, @processed_end will be the last byte of locked page. Or it will be @start otherwise. This change will impact lock_delalloc_pages(), so it needs to check @processed_end to only unlock the range if we have locked any. Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64] Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
968f2566ad |
btrfs: fix hang when run_delalloc_range() failed
[BUG]
When running subpage preparation patches on x86, btrfs/125 will hang
forever with one ordered extent never finished.
[CAUSE]
The test case btrfs/125 itself will always fail as the fix is never merged.
When the test fails at balance, btrfs needs to cleanup the ordered
extent in btrfs_cleanup_ordered_extents() for data reloc inode.
The problem is in the sequence how we cleanup the page Order bit.
Currently it works like:
btrfs_cleanup_ordered_extents()
|- find_get_page();
|- btrfs_page_clear_ordered(page);
| Now the page doesn't have Ordered bit anymore.
| !!! This also includes the first (locked) page !!!
|
|- offset += PAGE_SIZE
| This is to skip the first page
|- __endio_write_update_ordered()
|- btrfs_mark_ordered_io_finished(NULL)
Except the first page, all ordered extents are finished.
Then the locked page is cleaned up in __extent_writepage():
__extent_writepage()
|- If (PageError(page))
|- end_extent_writepage()
|- btrfs_mark_ordered_io_finished(page)
|- if (btrfs_test_page_ordered(page))
|- !!! The page gets skipped !!!
The ordered extent is not decreased as the page doesn't
have ordered bit anymore.
This leaves the ordered extent with bytes_left == sectorsize, thus never
finish.
[FIX]
The fix is to ensure we never clear page Ordered bit without running the
ordered extent accounting.
Here we choose to skip the locked page in
btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
properly finish the ordered extent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
f57ad93735 |
btrfs: rename PagePrivate2 to PageOrdered inside btrfs
Inside btrfs we use Private2 page status to indicate we have an ordered extent with pending IO for the sector. But the page status name, Private2, tells us nothing about the bit itself, so this patch will rename it to Ordered. And with extra comment about the bit added, so reader who is still uncertain about the page Ordered status, will find the comment pretty easily. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3b8358407a |
btrfs: refactor btrfs_invalidatepage() for subpage support
This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.
The involved modifications are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
Each branch will also have comments why we can or cannot delete the
extent states
- Do qgroup free and extent states deletion per-loop
Current code can only work for PAGE_SIZE == sectorsize case.
This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
We're completely safe to remove all extent states for the sector(s)
- Sectors with ordered extent, but no Private2 bit
This means the endio has already been executed, we can't remove all
extent states for the sector(s).
- Sectors with ordere extent, still has Private2 bit
This means we need to decrease the ordered extent accounting.
And then it comes to two different variants:
* We have finished and removed the ordered extent
Then it's the same as "sectors without ordered extent"
* We didn't finished the ordered extent
We can remove some extent states, but not all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
c095f3333f |
btrfs: introduce btrfs_lookup_first_ordered_range()
Although we already have btrfs_lookup_first_ordered_extent() and btrfs_lookup_ordered_extent(), they all have their own limitations: - btrfs_lookup_ordered_extent() can't do extra range check It's only designed to lookup any ordered extent before certain bytenr. - btrfs_lookup_first_ordered_extent() may not return the first ordered extent in the range It doesn't ensure the first ordered extent is returned. The existing callers are only interested in exhausting all ordered extents in a range, the order is not important. For incoming btrfs_invalidatepage() refactoring, we need a way to properly iterate all ordered extents in their bytenr order of a range. So this patch will introduce a new function, btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr order awareness and extra range check. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
266a258678 |
btrfs: update comments in btrfs_invalidatepage()
The existing comments in btrfs_invalidatepage() don't really get to the point, especially for what Private2 is really representing and how the race avoidance is done. The truth is, there are only three entrances to do ordered extent accounting: - btrfs_writepage_endio_finish_ordered() - __endio_write_update_ordered() Those two entrance are just endio functions for dio and buffered write. - btrfs_invalidatepage() But there is a pitfall, in endio functions there is no check on whether the ordered extent is already accounted. They just blindly clear the Private2 bit and do the accounting. So it's all btrfs_invalidatepage()'s responsibility to make sure we won't do double account for the same sector. That's why in btrfs_invalidatepage() we have to wait for page writeback, this will ensure all submitted bios have finished, thus their endio functions have finished the accounting on the ordered extent. Then we also check page Private2 to ensure that, we only run ordered extent accounting on pages who has no bio submitted. This patch will rework related comments to make it more clear on the race and how we use wait_on_page_writeback() and Private2 to prevent double accounting on ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e65f152e43 |
btrfs: refactor how we finish ordered extent io for endio functions
Btrfs has two endio functions to mark certain io range finished for ordered extents: - __endio_write_update_ordered() This is for direct IO - btrfs_writepage_endio_finish_ordered() This for buffered IO. However they go different routines to handle ordered extent io: - Whether to iterate through all ordered extents __endio_write_update_ordered() will but btrfs_writepage_endio_finish_ordered() will not. In fact, iterating through all ordered extents will benefit later subpage support, while for current PAGE_SIZE == sectorsize requirement this behavior makes no difference. - Whether to update page Private2 flag __endio_write_update_ordered() will not update page Private2 flag as for iomap direct IO, the page can not be even mapped. While btrfs_writepage_endio_finish_ordered() will clear Private2 to prevent double accounting against btrfs_invalidatepage(). Those differences are pretty subtle, and the ordered extent iterations code in callers makes code much harder to read. So this patch will introduce a new function, btrfs_mark_ordered_io_finished(), to do the heavy lifting: - Iterate through all ordered extents in the range - Do the ordered extent accounting - Queue the work for finished ordered extent This function has two new feature: - Proper underflow detection and recovery The old underflow detection will only detect the problem, then continue. No proper info like root/inode/ordered extent info, nor noisy enough to be caught by fstests. Furthermore when underflow happens, the ordered extent will never finish. New error detection will reset the bytes_left to 0, do proper kernel warning, and output extra info including root, ino, ordered extent range, the underflow value. - Prevent double accounting based on Private2 flag Now if we find a range without Private2 flag, we will skip to next range. As that means someone else has already finished the accounting of ordered extent. This makes no difference for current code, but will be a critical part for incoming subpage support, as we can call btrfs_mark_ordered_io_finished() for multiple sectors if they are beyond inode size. Thus such double accounting prevention is a key feature for subpage. Now both endio functions only need to call that new function. And since the only caller of btrfs_dec_test_first_ordered_pending() is removed, also remove btrfs_dec_test_first_ordered_pending() completely. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
87b4d86baa |
btrfs: make Private2 lifespan more consistent
Currently we use page Private2 bit to indicate that we have ordered
extent for the page range.
But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:
T ----- Page marked Dirty
|
+ ----- Page marked Private2, through btrfs_run_dealloc_range()
|
+ ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
| in __extent_writepage_io()
| ^^^ Private2 cleared for the first time
|
+ ----- Page marked Writeback, through btrfs_set_range_writeback()
| in __extent_writepage_io().
|
+ ----- Page cleared Private2, through
| btrfs_writepage_endio_finish_ordered()
| ^^^ Private2 cleared for the second time.
|
+ ----- Page cleared Writeback, through
btrfs_writepage_endio_finish_ordered()
Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.
But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.
The race prevention only happens through btrfs_invalidatepage(), where
we wait for the page writeback first, before checking the Private2 bit.
This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.
This patch will change btrfs_writepage_cow_fixup() to just check
PagePrivate2, not to clear it.
The clearing will happen in either btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().
This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.
And this patch is a hard requirement for the incoming refactoring for
how we finished ordered IO for endio context, as the coming patch will
check Private2 to determine if we need to do the ordered extent
accounting. Thus this patch is definitely needed or we will hang due to
unfinished ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
38a39ac77e |
btrfs: pass btrfs_inode to btrfs_writepage_endio_finish_ordered()
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in end_compressed_bio_write(). It passes compressed pages to btrfs_writepage_endio_finish_ordered(), which is only supposed to accept inode pages. Thankfully the important info here is the inode, so let's pass btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and make @page parameter optional. By this, end_compressed_bio_write() can happily pass page=NULL while still getting everything done properly. Also, to cooperate with such modification, replace @page parameter for trace_btrfs_writepage_end_io_hook() with btrfs_inode. Although this removes page_index info, the existing start/len should be enough for most usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fa04c16574 |
btrfs: make subpage metadata write path call its own endio functions
For subpage metadata, we're reusing two functions for subpage metadata write: - end_bio_extent_buffer_writepage() - write_one_eb() But the truth is, for subpage we just call end_bio_subpage_eb_writepage() without using any bit in end_bio_extent_buffer_writepage(). For write_one_eb(), it's pretty similar, but with a small part of code reused. There is really no need to pollute the existing code path if we're not really using most of them. So this patch will do the following change to separate the subpage metadata write path from regular write path by: - Use end_bio_subpage_eb_writepage() directly as endio in write_one_subpage_eb() - Directly call write_one_subpage_eb() in submit_eb_subpage() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
390ed29b81 |
btrfs: refactor submit_extent_page() to make bio and its flag tracing easier
There is a lot of code inside extent_io.c needs both "struct bio **bio_ret" and "unsigned long prev_bio_flags", along with some parameters like "unsigned long bio_flags". Such strange parameters are here for bio assembly. For example, we have such inode page layout: 0 4K 8K 12K |<-- Extent A-->|<- EB->| Then what we do is: - Page [0, 4K) *bio_ret = NULL So we allocate a new bio to bio_ret, Add page [0, 4K) to *bio_ret. - Page [4K, 8K) *bio_ret != NULL We found this page is continuous to *bio_ret, and if we're not at stripe boundary, we add page [4K, 8K) to *bio_ret. - Page [8K, 12K) *bio_ret != NULL But we found this page is not continuous, so we submit *bio_ret, then allocate a new bio, and add page [8K, 12K) to the new bio. This means we need to record both the bio and its bio_flag, but we record them manually using those strange parameter list, other than encapsulating them into their own structure. So this patch will introduce a new structure, btrfs_bio_ctrl, to record both the bio, and its bio_flags. Also, in above case, for all pages added to the bio, we need to check if the new page crosses stripe boundary. This check itself can be time consuming, and we don't really need to do that for each page. This patch also integrates the stripe boundary check into btrfs_bio_ctrl. When a new bio is allocated, the stripe and ordered extent boundary is also calculated, so no matter how large the bio will be, we only calculate the boundaries once, to save some CPU time. The following functions/structures are affected: - struct extent_page_data Replace its bio pointer with structure btrfs_bio_ctrl (embedded structure, not pointer) - end_write_bio() - flush_write_bio() Just change how bio is fetched - btrfs_bio_add_page() Use pre-calculated boundaries instead of re-calculating them. And use @bio_ctrl to replace @bio and @prev_bio_flags. - calc_bio_boundaries() New function - submit_extent_page() callers - btrfs_do_readpage() callers - contiguous_readpages() callers To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab bio. - btrfs_bio_fits_in_ordered_extent() Removed, as now the ordered extent size limit is done at bio allocation time, no need to check for each page range. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1a0b5c4d64 |
btrfs: allow btrfs_bio_fits_in_stripe() to accept bio without any page
Function btrfs_bio_fits_in_stripe() now requires a bio with at least one page added. Or btrfs_get_chunk_map() will fail with -ENOENT. But in fact this requirement is not needed at all, as we can just pass sectorsize for btrfs_get_chunk_map(). This tiny behavior change is important for later subpage refactoring on submit_extent_page(). As for 64K page size, we can have a page range with pgoff=0 and size=64K. If the logical bytenr is just 16K before the stripe boundary, we have to split the page range into two bios. This means, we must check page range against stripe boundary, even adding the range to an empty bio. This tiny refactoring is for the incoming changes, but on its own, regular sectorsize == PAGE_SIZE is not affected anyway. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
43c0d1a5e1 |
btrfs: remove the unused parameter @len for btrfs_bio_fits_in_stripe()
The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.
It got removed in
|
|
|
|
0044ae11e8 |
btrfs: make free space cache size consistent across different PAGE_SIZE
Currently free space cache inode size is determined by two factors: - block group size - PAGE_SIZE This means, for the same sized block groups, with different PAGE_SIZE, it will result in different inode sizes. This will not be a good thing for subpage support, so change the requirement for PAGE_SIZE to sectorsize. Now for the same 4K sectorsize btrfs, it should result the same inode size no matter what the PAGE_SIZE is. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8df507cbb5 |
btrfs: scrub: fix subpage repair error caused by hard coded PAGE_SIZE
[BUG] For the following file layout, scrub will not be able to repair all these two repairable error, but in fact make one corruption even unrepairable: inode offset 0 4k 8K Mirror 1 |XXXXXX| | Mirror 2 | |XXXXXX| [CAUSE] The root cause is the hard coded PAGE_SIZE, which makes scrub repair to go crazy for subpage. For above case, when reading the first sector, we use PAGE_SIZE other than sectorsize to read, which makes us to read the full range [0, 64K). In fact, after 8K there may be no data at all, we can just get some garbage. Then when doing the repair, we also writeback a full page from mirror 2, this means, we will also writeback the corrupted data in mirror 2 back to mirror 1, leaving the range [4K, 8K) unrepairable. [FIX] This patch will modify the following PAGE_SIZE use with sectorsize: - scrub_print_warning_inode() Remove the min() and replace PAGE_SIZE with sectorsize. The min() makes no sense, as csum is done for the full sector with padding. This fixes a bug that subpage report extra length like: checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test, physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file) Where the error is only 1 sector. - scrub_handle_errored_block() Comments with PAGE|page involved, all changed to sector. - scrub_setup_recheck_block() - scrub_repair_page_from_good_copy() - scrub_add_page_to_wr_bio() - scrub_wr_submit() - scrub_add_page_to_rd_bio() - scrub_block_complete() Replace PAGE_SIZE with sectorsize. This solves several problems where we read/write extra range for subpage case. RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE usage, and is not really safe enough. Thus we will reject RAID56 for subpage in later commit. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ec87b42f70 |
btrfs: use list_last_entry in add_falloc_range
Instead of calling list_entry with head->prev simply call list_last_entry which makes it obvious which member of the list is being referred. This allows to remove the extra 'prev' pointer. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4183abf6cb |
btrfs: fix comment about max_out in btrfs_compress_pages
Commit
|
|
|
|
65b5355f77 |
btrfs: optimize variables size in btrfs_submit_compressed_write
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Function arguments @len, @compressed_len and @nr_pages can be declared as unsigned int. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
356b4a2dc1 |
btrfs: optimize variables size in btrfs_submit_compressed_read
Patch "btrfs: reduce compressed_bio member's types" reduced some member's size. Declare the variables @compressed_len, @nr_pages and @pg_index size as an unsigned int in the function btrfs_submit_compressed_read. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1d08ce5840 |
btrfs: reduce the variable size to fit nr_pages
Patch "btrfs: reduce compressed_bio member's types" reduced the @nr_pages size to unsigned int, its cascading effects are updated here. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b590b83972 |
btrfs: avoid unnecessary logging of xattrs during fast fsyncs
When logging an inode we always log all its xattrs, so that we are able
to figure out which ones should be deleted during log replay. However this
is unnecessary when we are doing a fast fsync and no xattrs were added,
changed or deleted since the last time we logged the inode in the current
transaction.
So skip the logging of xattrs when the inode was previously logged in the
current transaction and no xattrs were added, changed or deleted. If any
changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING
set in its runtime flags and the xattrs get logged. This saves time on
scanning for xattrs, allocating memory, COWing log tree extent buffers and
adding more lock contention on the extent buffers when there are multiple
tasks logging in parallel.
The use of xattrs is common when using ACLs, some applications, or when
using security modules like SELinux where every inode gets a security
xattr added to it.
The following test script, using fio, was used on a box with 12 cores, 64G
of RAM, a NVMe device and the default non-debug kernel config from Debian.
It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file,
each file with a single xattr of 50 bytes (about the same size for an ACL
or SELinux xattr), doing random buffered writes with an fsync after each
write.
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/test
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
NUM_JOBS=8
FILE_SIZE=4G
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bs=64K
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating files before fio runs, each with 1 xattr of 50 bytes"
for ((i = 0; i < $NUM_JOBS; i++)); do
path="$MNT/writers.$i.0"
truncate -s $FILE_SIZE $path
setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path
done
fio /tmp/fio-job.ini
umount $MNT
fio output before this change:
WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec
fio output after this change:
WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec
+16.8% throughput, -16.6% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
67ae34b69c |
btrfs: add device delete cancel
Accept device name "cancel" as a request to cancel running device deletion operation. The string is literal, in case there's a real device named "cancel", pass it as full absolute path or as "./cancel" This works for v1 and v2 ioctls when the device is specified by name. Moving chunks from the device uses relocation, use the conditional exclusive operation start and cancellation helpers Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bb059a37c9 |
btrfs: add cancellation to resize
Accept literal string "cancel" as resize operation and interpret that as a request to cancel the running operation. If it's running, wait until it finishes current work and return ECANCELED. Shrinking resize uses relocation to move the chunks away, use the conditional exclusive operation start and cancellation helpers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
17aaa434ed |
btrfs: add wrapper for conditional start of exclusive operation
To support optional cancellation of some operations, add helper that will wrap all the combinations. In normal mode it's same as btrfs_exclop_start, in cancellation mode it checks if it's already running and request cancellation and waits until completion. The error codes can be returned to to user space and semantics is not changed, adding ECANCELED. This should be evaluated as an error and that the operation has not completed and the operation should be restarted or the filesystem status reviewed. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
578bda9e17 |
btrfs: introduce try-lock semantics for exclusive op start
Add try-lock for exclusive operation start to allow callers to do more checks. The same operation must already be running. The try-lock and unlock must pair and are a substitute for btrfs_exclop_start, thus it must also pair with btrfs_exclop_finish to release the exclop context. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
907d2710d7 |
btrfs: add cancellable chunk relocation support
Add support code that will allow canceling relocation on the chunk granularity. This is different and independent of balance, that also uses relocation but is a higher level operation and manages it's own state and pause/cancellation requests. Relocation is used for resize (shrink) and device deletion so this will be a common point to implement cancellation for both. The context is entirely in btrfs_relocate_block_group and btrfs_recover_relocation, enclosing one chunk relocation. The status bit is set and unset between the chunks. As relocation can take long, the effects may not be immediate and the request and actual action can slightly race. The fs_info::reloc_cancel_req is only supposed to be increased and does not pair with decrement like fs_info::balance_cancel_req. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0d7ed32c1e |
btrfs: protect exclusive_operation by super_lock
The exclusive operation is now atomically checked and set using bit operations. Switch it to protection by spinlock. The super block lock is not frequently used and adding a new lock seems like an overkill so it should be safe to reuse it. The reason to use spinlock is to enhance the locking context so more checks can be done, eg. allowing the same exclusive operation enter the exclop section and cancel the running one. This will be used for resize and device delete. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
24880be59c |
btrfs: clean up header members offsets in write helpers
Move header offsetof() to the expression that calculates the address so it's part of get_eb_offset_in_page where the 2nd parameter is the member offset. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dfd29eed4a |
btrfs: simplify eb checksum verification in btrfs_validate_metadata_buffer
The verification copies the calculated checksum bytes to a temporary buffer but this is not necessary. We can map the eb header on the first page and use the checksum bytes directly. This saves at least one function call and boundary checks so it could lead to a minor performance improvement. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ff14aa7987 |
btrfs: remove extra sb::s_id from message in btrfs_validate_metadata_buffer
The s_id is already printed by message helpers. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
282ab3ff16 |
btrfs: reduce compressed_bio members' types
Several members of compressed_bio are of type that's unnecessarily big
for the values that they'd hold:
- the size of the uncompressed and compressed data is 128K now, we can
keep is as int
- same for number of pages
- the compress type fits to a byte
- the errors is 0/1
The size of the unpatched structure is 80 bytes with several holes.
Reordering nr_pages next to the pages the hole after pending_bios is
filled and the resulting size is 56 bytes. This keeps the csums array
aligned to 8 bytes, which is nice. Further size optimizations may be
possible but right now it looks good to me:
struct compressed_bio {
refcount_t pending_bios; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct page * * compressed_pages; /* 8 8 */
struct inode * inode; /* 16 8 */
u64 start; /* 24 8 */
unsigned int len; /* 32 4 */
unsigned int compressed_len; /* 36 4 */
u8 compress_type; /* 40 1 */
u8 errors; /* 41 1 */
/* XXX 2 bytes hole, try to pack */
int mirror_num; /* 44 4 */
struct bio * orig_bio; /* 48 8 */
u8 sums[]; /* 56 0 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 54, holes: 1, sum holes: 2 */
/* last cacheline: 56 bytes */
};
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
49547068f6 |
btrfs: document byte swap optimization of root_item::flags accessors
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7735cd755b |
btrfs: scrub: factor out common scrub_stripe constraints
There are common values set for the stripe constraints, some of them are already factored out. Do that for increment and mirror_num as well. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1aeb6b563a |
btrfs: clear log tree recovering status if starting transaction fails
When a log recovery is in progress, lots of operations have to take that into account, so we keep this status per tree during the operation. Long time ago error handling revamp patch |
|
|
|
6819703f5a |
btrfs: clear defrag status of a root if starting transaction fails
The defrag loop processes leaves in batches and starting transaction for each. The whole defragmentation on a given root is protected by a bit but in case the transaction fails, the bit is not cleared In case the transaction fails the bit would prevent starting defragmentation again, so make sure it's cleared. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8c5ec99561 |
btrfs: sysfs: fix format string for some discard stats
The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the format should be %llu, though the actual values would hardly ever overflow to negative values. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5963ffcaf3 |
btrfs: always abort the transaction if we abort a trans handle
While stress testing our error handling I noticed that sometimes we would still commit the transaction even though we had aborted the transaction. Currently we track if a trans handle has dirtied any metadata, and if it hasn't we mark the filesystem as having an error (so no new transactions can be started), but we will allow the current transaction to complete as we do not mark the transaction itself as having been aborted. This sounds good in theory, but we were not properly tracking IO errors in btrfs_finish_ordered_io, and thus committing the transaction with bogus free space data. This isn't necessarily a problem per-se with the free space cache, as the other guards in place would have kept us from accepting the free space cache as valid, but highlights a real world case where we had a bug and could have corrupted the filesystem because of it. This "skip abort on empty trans handle" is nice in theory, but assumes we have perfect error handling everywhere, which we clearly do not. Also we do not allow further transactions to be started, so all this does is save the last transaction that was happening, which doesn't necessarily gain us anything other than the potential for real corruption. Remove this particular bit of code, if we decide we need to abort the transaction then abort the current one and keep us from doing real harm to the file system, regardless of whether this specific trans handle dirtied anything or not. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0d7d316597 |
btrfs: don't set the full sync flag when truncation does not touch extents
At btrfs_truncate() where we truncate the inode either to the same size or to a smaller size, we always set the full sync flag on the inode. This is needed in case the truncation drops or trims any file extent items that start beyond or cross the new inode size, so that the next fsync drops all inode items from the log and scans again the fs/subvolume tree to find all items that must be logged. However if the truncation does not drop or trims any file extent items, we do not need to set the full sync flag and force the next fsync to use the slow code path. So do not set the full sync flag in such cases. One use case where it is frequent to do truncations that do not change the inode size and do not drop any extents (no prealloc extents beyond i_size) is when running Microsoft's SQL Server inside a Docker container. One example workload is the one Philipp Fent reported recently, in the thread with a link below. In this workload a large number of fsyncs are preceded by such truncate operations. After this change I constantly get the runtime for that workload from Philipp to be reduced by about -12%, for example from 184 seconds down to 162 seconds. Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/ Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4f7e67378e |
btrfs: fix misleading and incomplete comment of btrfs_truncate()
The comment at the top of btrfs_truncate() mentions that csum items are dropped or truncated to the new i_size, but this is wrong and non sense, as they are unrelated to the i_size and are located in the csums tree and not on a tree with inode items (fs/subvolume tree or a log tree). Instead that claim applies to file extent items, so fix the comment to refer to them instead. While at it make the whole comment for the function more descriptive and follow the kernel doc style. Tested-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
04587ad9be |
btrfs: abort transaction if we fail to update the delayed inode
If we fail to update the delayed inode we need to abort the transaction, because we could leave an inode with the improper counts or some other such corruption behind. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bb385bedde |
btrfs: fix error handling in __btrfs_update_delayed_inode
If we get an error while looking up the inode item we'll simply bail without cleaning up the delayed node. This results in this style of warning happening on commit: WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90 CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90 RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000 RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060 RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000 R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00 R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90 FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0 Call Trace: btrfs_commit_transaction+0x43c/0xb00 ? finish_wait+0x80/0x80 ? vfs_fsync_range+0x90/0x90 iterate_supers+0x8c/0x100 ksys_sync+0x50/0x90 __do_sys_sync+0xa/0x10 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae Because the iref isn't dropped and this leaves an elevated node->count, so any release just re-queues it onto the delayed inodes list. Fix this by going to the out label to handle the proper cleanup of the delayed node. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a4cb90dc01 |
btrfs: make btrfs_release_delayed_iref handle the !iref case
Right now we only cleanup the delayed iref if we have BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error conditions that need to cleanup the iref if it still exists, so to make this code cleaner move the test_bit into btrfs_release_delayed_iref itself and unconditionally call it in each of the cases instead. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
eb3b505366 |
btrfs: scrub: per-device bandwidth control
Add sysfs interface to limit io during scrub. We relied on the ionice interface to do that, eg. the idle class let the system usable while scrub was running. This has changed when mq-deadline got widespread and did not implement the scheduling classes. That was a CFQ thing that got deleted. We've got numerous complaints from users about degraded performance. Currently only BFQ supports that but it's not a common scheduler and we can't ask everybody to switch to it. Alternatively the cgroup io limiting can be used but that also a non-trivial setup (v2 required, the controller must be enabled on the system). This can still be used if desired. Other ideas that have been explored: piggy-back on ionice (that is set per-process and is accessible) and interpret the class and classdata as bandwidth limits, but this does not have enough flexibility as there are only 8 allowed and we'd have to map fixed limits to each value. Also adjusting the value would need to lookup the process that currently runs scrub on the given device, and the value is not sticky so would have to be adjusted each time scrub runs. Running out of options, sysfs does not look that bad: - it's accessible from scripts, or udev rules - the name is similar to what MD-RAID has (/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max) - the value is sticky at least for filesystem mount time - adjusting the value has immediate effect - sysfs is available in constrained environments (eg. system rescue) - the limit also applies to device replace Sysfs: - raw value is in bytes - values written to the file accept suffixes like K, M - file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max - 0 means use default priority of IO The scheduler is a simple deadline one and the accuracy is up to nearest 128K. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e7ff9e6b8e |
btrfs: zoned: factor out zoned device lookup
To be able to construct a zone append bio we need to look up the btrfs_device. The code doing the chunk map lookup to get the device is present in btrfs_submit_compressed_write and submit_extent_page. Factor out the lookup calls into a helper and use it in the submission paths. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
50535db8fb |
btrfs: return EAGAIN if defrag is canceled
When inode defrag is canceled, the error is set to EAGAIN but then overwritten by number of defragmented bytes. As this would hide the error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it will print the error and continue to next file (as it does in for any other error). Signed-off-by: Tian Tao <tiantao6@hisilicon.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1245835d24 |
btrfs: remove io_failure_record::in_validation
The io_failure_record::in_validation was introduced to handle failed bio which cross several sectors. In such case, we still need to verify which sectors are corrupted. But since we've changed the way how we handle corrupted sectors, by only submitting repair for each corrupted sector, there is no need for extra validation any more. This patch will cleanup all io_failure_record::in_validation related code. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
150e4b0597 |
btrfs: submit read time repair only for each corrupted sector
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
08508fea07 |
btrfs: make btrfs_verify_data_csum() to return a bitmap
This will provide the basis for later per-sector repair for subpage, while still keeping the existing code happy. As if all csums match, the return value will be 0, same as now. Only when csum mismatches, the return value is different. The new return value will be a bitmap, for 4K sectorsize and 4K page size, it will be either 1, instead of the -EIO (which is not used directly by the callers, no effective change). But for 4K sectorsize and 64K page size, aka subpage case, since the bvec can contain multiple sectors, knowing which sectors are corrupted will allow us to submit repair only for corrupted sectors. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f4dcfb3045 |
btrfs: rename check_async_write and let it return bool
The 'check_async_write' function is a helper used in 'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be used for metadata. Make the function return bool and get rid of the local variable async in btrfs_submit_metadata_bio storing the result of check_async_write's tests. As this is touching all function call sites, also rename it to should_async_write as this is more in line with the naming we use. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
06e1e7f422 |
btrfs: zoned: bail out if we can't read a reliable write pointer
If we can't read a reliable write pointer from a sequential zone fail creating the block group with an I/O error. Also if the read write pointer is beyond the end of the respective zone, fail the creation of the block group on this zone with an I/O error. While this could also happen in real world scenarios with misbehaving drives, this issue addresses a problem uncovered by fstests' test case generic/475. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
47cdfb5e1d |
btrfs: zoned: print message when zone sanity check type fails
This extends patch |
|
|
|
385f421f18 |
btrfs: handle preemptive delalloc flushing slightly differently
If we decide to flush delalloc from the preemptive flusher, we really do not want to wait on ordered extents, as it gains us nothing. However there was logic to go ahead and wait on ordered extents if there was more ordered bytes than delalloc bytes. We do not want this behavior, so pass through whether this flushing is for preemption, and do not wait for ordered extents if that's the case. Also break out of the shrink loop after the first flushing, as we just want to one shot shrink delalloc. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3e10156997 |
btrfs: only ignore delalloc if delalloc is much smaller than ordered
While testing heavy delalloc workloads I noticed that sometimes we'd just stop preemptively flushing when we had loads of delalloc available to flush. This is because we skip preemptive flushing if delalloc <= ordered. However if we start with say 4gib of delalloc, and we flush 2gib of that, we'll stop flushing there, when we still have 2gib of delalloc to flush. Instead adjust the ordered bytes down by half, this way if 2/3 of our outstanding delalloc reservations are tied up by ordered extents we don't bother preemptive flushing, as we're getting close to the state where we need to wait on ordered extents. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
30acce4eb0 |
btrfs: don't include the global rsv size in the preemptive used amount
When deciding if we should preemptively flush space, we will add in the amount of space used by all block rsvs. However this also includes the global block rsv, which isn't flushable so shouldn't be accounted for in this calculation. If we decide to use ->bytes_may_use in our used calculation we need to subtract the global rsv size from this amount so it most closely matches the flushable space. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1239e2da16 |
btrfs: use the global rsv size in the preemptive thresh calculation
We calculate the amount of "free" space available for normal reservations by taking the total space and subtracting out the hard used space, which is readonly, used, and reserved space. However we weren't taking into account the global block rsv, which is essentially hard used space. Handle this by subtracting it from the available free space, so that our threshold more closely mirrors reality. We need to do the check because it's possible that the global_rsv_size + used is > total_bytes, sometimes the global reserve can end up being calculated as larger than the available size (think small filesystems where we only have the original 8MiB chunk of metadata). It doesn't usually happen, but that can get us into trouble so this is safer. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
610a6ef44e |
btrfs: take into account global rsv in need_preemptive_reclaim
Global rsv can't be used for normal allocations, and for very full file systems we can decide to try and async flush constantly even though there's really not a lot of space to reclaim. Deal with this by including the global block rsv size in the "total used" calculation. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0aae4ca9e9 |
btrfs: only clamp the first time we have to start flushing
We were clamping the threshold for preemptive reclaim any time we added a ticket to wait on, which if we have a lot of threads means we'd essentially max out the clamp the first time we start to flush. Instead of doing this, simply do it every time we have to start flushing, this will make us ramp up gradually instead of going to max clamping as soon as we start needing to do flushing. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ed738ba7f9 |
btrfs: check worker before need_preemptive_reclaim
need_preemptive_reclaim() does some calculations, which aren't heavy, but if we're already running preemptive reclaim there's no reason to do them at all, so re-order the checks so that we don't do the calculation if we're already doing reclaim. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
94358c35d8 |
btrfs: remove stale comment for argument seed of btrfs_find_device
Commit
|
|
|
|
dc56219fe2 |
btrfs: correct try_lock_extent() usage in read_extent_buffer_subpage()
try_lock_extent() returns 1 on success or 0 for failure and not an error code. If try_lock_extent() fails, read_extent_buffer_subpage() returns zero indicating subpage extent read success. Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the extent. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6fab154a33 |
for-5.13-rc6-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDNEFMACgkQxWXV+ddt WDuZQg/7BpGG3IDhxydM7fUrNT0xmW2/0VG8blXAgNTiaUO1zOrlrlDKm38+dtW6 yEv3ruf68tggrPNRCkyh51n45+ExqNwc7WwrxaKIRKmvYhYDsxnt8JLiNkv64isi R/CQVETX1cKsMuRhBuqmUq3Sy6VJZoi6coUHIC7ddBcLqnz8c9p7oGqzxBT8J9u3 1CkDSeLM4HKlISlVKhmT4lRG28cQTuy3mSABUt7N5ljJvrrpQAvEN1HCOE9XUQFQ wHH2DjNnBMvfB7mrGCBL7XGf8DF6ucgcyfofuOj6CQLFJ8bOnVKsk8dk/8XUQod+ rQoNIrVwW91LjmEO/I767JmjrRMtHbXvl3DEw3BvaD/O4efw78SN2VN+DRi4j7Xx aMtAWWfakfIyyJNZu9IEDa736iCdp+yl4bnq+hZpqmOYRqTq8n/zWuCMWZ5ohNay JyjxCm+xgo3vH9VEgzje6GDUki3I4Bwe7VlsaMr9F6F5GKzFp/4fb9lCewBrH6le +Y4gWxRT09plThsC2N3qmBQ9uVIJUyzmvcsYiMJ95tb24srdcPUTCG0C9zBvuMCC nm+1n5d3ENSEBaRNKQsC3MYcjKIh8VDEaKnntJrHAzHP41hrD+makrw3LVs6wLzu amGYz40XNq8zK2Xxv/N8O/U/PwQWKGj4bxq/2c1Wi9p9HACWfgk= =JbJO -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more fix, for a space accounting bug in zoned mode. It happens when a block group is switched back rw->ro and unusable bytes (due to zoned constraints) are subtracted twice. It has user visible effects so I consider it important enough for late -rc inclusion and backport to stable" * tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix negative space_info->bytes_readonly |
|
|
|
f9f28e5bd0 |
btrfs: zoned: fix negative space_info->bytes_readonly
Consider we have a using block group on zoned btrfs.
|<- ZU ->|<- used ->|<---free--->|
`- Alloc offset
ZU: Zone unusable
Marking the block group read-only will migrate the zone unusable bytes
to the read-only bytes. So, we will have this.
|<- RO ->|<- used ->|<--- RO --->|
RO: Read only
When marking it back to read-write, btrfs_dec_block_group_ro()
subtracts the above "RO" bytes from the
space_info->bytes_readonly. And, it moves the zone unusable bytes back
and again subtracts those bytes from the space_info->bytes_readonly,
leading to negative bytes_readonly.
This can be observed in the output as eg.:
Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688
This commit fixes the issue by reordering the operations.
Link: https://github.com/naota/linux/issues/37
Reported-by: David Sterba <dsterba@suse.com>
Fixes:
|
|
|
|
f0b65f39ac |
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the callers do *not* need to do iov_iter_advance() after it. In case when they end up consuming less than they'd been given they need to do iov_iter_revert() on everything they had not consumed. That, however, needs to be done only on slow paths. All in-tree callers converted. And that kills the last user of iterate_all_kinds() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> |
|
|
|
cc6cf827dd |
for-5.13-rc5-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDAtXUACgkQxWXV+ddt WDtbdA//ccQ8JL5yC/x/j0ZXLJ2INqXpxIUPjadwwEjtTgOllvx+f1nU0QazeYfM XvvzDDvpemWajC2Ii54s2HCQbG+dAzO1YBl1XCyve91T0GeNGhzytZwM0pVxZePQ A+aOyVH7IcfFcmBy9T0yctqiGgtD3lre208kU9kolidsIyomLHxBckBhMYDXvJCK BOdrjq3f6H5J0zqOqAnWdc/Wc5z5pw3CHxlIuoA3Tp0Gv9TIx366Z/IvmFfCyvCt kYv2qnUaw10OlFLiqhetlZyv49ibW4waj0RbyY/rZx+69sE/PM4961NYAjLoFJc2 6OoZZO4OHWrNZpBJfbyyX9KVLspix075FID7qVhE/AVW4CYZGOFu5wJyXQiYlysH 1qqkihK3gbKEsB2429UeLZktupmx79LBIgg346+DSQYiMXMTGR8iZY1onbBM2wlf bep65hsiHhxoC6Z/KhxrTGZM2jyYW2nICw3o0xikhWv7MZPWKfKHrH9NJQ9Lpuhy gxut0ef9HbPXWP9PgRmY0Z8PsUi8RT1bv0bHVw7EnhLbi62neJLyxY3Q++W+7vBG LYeaxKWLTTJu73wpBQHLI0pD0UifXLrTkiCI+4gN8zVfzxUl+90mGz2AdSRRFI+U kNdX/haEHi00WBqYxWt33ae/FuSHjPuYXjiPQA7Kiy/C3n9GAB0= =mGAq -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes that people hit during testing. Zoned mode fix: - fix 32bit value wrapping when calculating superblock offsets Error handling fixes: - properly check filesystema and device uuids - properly return errors when marking extents as written - do not write supers if we have an fs error" * tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: promote debugging asserts to full-fledged checks in validate_super btrfs: return value from btrfs_mark_extent_written() in case of error btrfs: zoned: fix zone number to sector/physical calculation btrfs: do not write supers if we have an fs error |
|
|
|
aefd7f7065 |
btrfs: promote debugging asserts to full-fledged checks in validate_super
Syzbot managed to trigger this assert while performing its fuzzing. Turns out it's better to have those asserts turned into full-fledged checks so that in case buggy btrfs images are mounted the users gets an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT disabled such image would have been erroneously allowed to be mounted. Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add uuids to the messages ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e7b2ec3d3d |
btrfs: return value from btrfs_mark_extent_written() in case of error
We always return 0 even in case of an error in btrfs_mark_extent_written(). Fix it to return proper error value in case of a failure. All callers handle it. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5b434df877 |
btrfs: zoned: fix zone number to sector/physical calculation
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t
sector" by shifting it. But, this "sector" is calculated in 32bit, leading
it to be 0 for the 2nd superblock copy.
Since zone number is u32, shifting it to sector (sector_t) or physical
address (u64) can easily trigger a missing cast bug like this.
This commit introduces helpers to convert zone number to sector/LBA, so we
won't fall into the same pitfall again.
Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
Fixes:
|
|
|
|
165ea85f14 |
btrfs: do not write supers if we have an fs error
Error injection testing uncovered a pretty severe problem where we could end up committing a super that pointed to the wrong tree roots, resulting in transid mismatch errors. The way we commit the transaction is we update the super copy with the current generations and bytenrs of the important roots, and then copy that into our super_for_commit. Then we allow transactions to continue again, we write out the dirty pages for the transaction, and then we write the super. If the write out fails we'll bail and skip writing the supers. However since we've allowed a new transaction to start, we can have a log attempting to sync at this point, which would be blocked on fs_info->tree_log_mutex. Once the commit fails we're allowed to do the log tree commit, which uses super_for_commit, which now points at fs tree's that were not written out. Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the tree_log_mutex. This way if the transaction commit fails we're sure to see this bit set and we can skip writing the super out. This patch fixes this specific transid mismatch error I was seeing with this particular error path. CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fd2ff2774e |
for-5.13-rc4-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmC435cACgkQxWXV+ddt
WDuh5w/+IGfsUFfKikJZpZUP7q/2gC0t0dzZemxeZMutJbT/KCZCDd4CjLf6YH6r
oV9uYIgOWGd3aem9fe0R60ErJ4htgszIgeydCw3s2EuTms6WvAVA6Wp+wK/3UNx3
vQgYsqYkhMzIYKm/D4q8G+bqA2nPbBTDRNsXDIDrZYONxwSb+dNbQCGVknBRzRPa
hiCqYhUSyXA7E6UZdlma7MvpDOquZN+iW3RRVx1AULLqVs01PCnG/CEN+0oQm2JE
r9IyRxOZUvSeW6opT80yzZFCoboNSduMjPENTfzLY6Q1xzS/EtP4kM86fB/7AoJv
UI0c3Sr84SC9vOsBsbGJaBHpxP3OpzxohKU///jVQgEDpGv4STPlkVfxk23BHcux
Fdfg7wodkXeLU1Ff4dlJhvCqNYqc5V8lT5Kl52ai9Scct6D4yZBAq4KJp2LmYFC0
cHv6xFxBUv5zFZP1j6NMOmiLlCdDEkOruku2mMweQOBWYW/lHYNU469V5RCvfbLl
HlbDrtZdnQ3m2IhpQrXiTnT47Ib4DPYWkhRVfWbyVJHA+CbcOV62RQfl+r95Bc7j
FB1gM5vwUTJV7wgzErrq7+BD8quxG6/NuLDFjHYRcIj1kSIMK4/I1fOWruzuK+CL
6n7LLvBOojYfFo+ruQMSp2imDn3JJucBuh0/ssOlUWl2zsy6lDA=
=8066
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Error handling improvements, caught by error injection:
- handle errors during checksum deletion
- set error on mapping when ordered extent io cannot be finished
- inode link count fixup in tree-log
- missing return value checks for inode updates in tree-log
- abort transaction in rename exchange if adding second reference
fails
Fixes:
- fix fsync failure after writes to prealloc extents
- fix deadlock when cloning inline extents and low on available space
- fix compressed writes that cross stripe boundary"
* tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: add btrfs IRC link
btrfs: fix deadlock when cloning inline extents and low on available space
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: check error value from btrfs_update_inode in tree log
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fix error handling in btrfs_del_csums
btrfs: fix compressed writes that cross stripe boundary
|
|
|
|
a8698707a1 |
block: move bd_mutex to struct gendisk
Replace the per-block device bd_mutex with a per-gendisk open_mutex, thus simplifying locking wherever we deal with partitions. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Roger Pau Monné <roger.pau@citrix.com> Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
76a6d5cd74 |
btrfs: fix deadlock when cloning inline extents and low on available space
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:
1) When starting writeback for an inode with a very small dirty range that
fits in an inline extent, we deadlock during the writeback when trying
to insert the inline extent, at cow_file_range_inline(), if the extent
is going to be located in the leaf for which we are already holding a
read lock;
2) After successfully starting writeback, for non-inline extent cases,
the async reclaim thread will hang waiting for an ordered extent to
complete if the ordered extent completion needs to modify the leaf
for which the clone task is holding a read lock (for adding or
replacing file extent items). So the cloning task will wait forever
on the async reclaim thread to make progress, which in turn is
waiting for the ordered extent completion which in turn is waiting
to acquire a write lock on the same leaf.
So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.
Fixes:
|
|
|
|
ea7036de0d |
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
|
|
|
|
dc09ef3562 |
btrfs: abort in rename_exchange if we fail to insert the second ref
Error injection stress uncovered a problem where we'd leave a dangling inode ref if we failed during a rename_exchange. This happens because we insert the inode ref for one side of the rename, and then for the other side. If this second inode ref insert fails we'll leave the first one dangling and leave a corrupt file system behind. Fix this by aborting if we did the insert for the first inode ref. CC: stable@vger.kernel.org # 4.9+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f96d44743a |
btrfs: check error value from btrfs_update_inode in tree log
Error injection testing uncovered a case where we ended up with invalid link counts on an inode. This happened because we failed to notice an error when updating the inode while replaying the tree log, and committed the transaction with an invalid file system. Fix this by checking the return value of btrfs_update_inode. This resolved the link count errors I was seeing, and we already properly handle passing up the error values in these paths. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
011b28acf9 |
btrfs: fixup error handling in fixup_inode_link_counts
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
d61bec08b9 |
btrfs: mark ordered extent and inode with error if we fail to finish
While doing error injection testing I saw that sometimes we'd get an abort that wouldn't stop the current transaction commit from completing. This abort was coming from finish ordered IO, but at this point in the transaction commit we should have gotten an error and stopped. It turns out the abort came from finish ordered io while trying to write out the free space cache. It occurred to me that any failure inside of finish_ordered_io isn't actually raised to the person doing the writing, so we could have any number of failures in this path and think the ordered extent completed successfully and the inode was fine. Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and marking the mapping of the inode with mapping_set_error, so any callers that simply call fdatawait will also get the error. With this we're seeing the IO error on the free space inode when we fail to do the finish_ordered_io. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
856bd270dc |
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
We are unconditionally returning 0 in cleanup_ref_head, despite the fact that btrfs_del_csums could fail. We need to return the error so the transaction gets aborted properly, fix this by returning ret from btrfs_del_csums in cleanup_ref_head. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b86652be7c |
btrfs: fix error handling in btrfs_del_csums
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent. This occurred because the pattern
in btrfs_del_csums was
while (1) {
ret = btrfs_search_slot();
if (ret < 0)
break;
}
ret = 0;
out:
btrfs_free_path(path);
return ret;
If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out. Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the
ret = 0;
out:
pattern and simply allow break to have the proper error reporting. With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
4c80a97d7b |
btrfs: fix compressed writes that cross stripe boundary
[BUG] When running btrfs/027 with "-o compress" mount option, it always crashes with the following call trace: BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192 ------------[ cut here ]------------ kernel BUG at fs/btrfs/volumes.c:6651! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G OE 5.13.0-rc2-custom+ #26 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Workqueue: btrfs-delalloc btrfs_work_helper [btrfs] RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs] Call Trace: btrfs_submit_compressed_write+0x2d7/0x470 [btrfs] submit_compressed_extents+0x3b0/0x470 [btrfs] ? mark_held_locks+0x49/0x70 btrfs_work_helper+0x131/0x3e0 [btrfs] process_one_work+0x28f/0x5d0 worker_thread+0x55/0x3c0 ? process_one_work+0x5d0/0x5d0 kthread+0x141/0x160 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 ---[ end trace 63113a3a91f34e68 ]--- [CAUSE] The critical message before the crash means we have a bio at logical bytenr 298901504 length 12288, but only 8192 bytes can fit into one stripe, the remaining 4096 bytes go to another stripe. In btrfs, all bios are properly split to avoid cross stripe boundary, but commit |
|
|
|
45af60e7ce |
for-5.13-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCoEQkACgkQxWXV+ddt WDsn6Q//XXQVextL6g6Wjx0SR9b5C1ndSV841jNY+KQ0drBPSOBs+0SXI+nIWAK1 iTpmj3s2qrRElZZ6DT4fKP28KnbUJed9+CcirNnN3IMOeauI760CLobXZLsw1wGH o0HKKgcPhw/v9o9jqX22rSfzDZ2Rx2KhZ8iEb1ZXIG5iJNFcnXCCoFOqk4I+UEvH /5734KU8RI3sCRhziSf/vDCF50p+BIWr8VilQkmZUzi0oa6Y1wXm0qd9j0unhICR NxcBk1NYdOosAvVRhSqync1BNLhXSctg4rwhLlSI5SDvt/Ivz5tguNr9HcizOvmW zyb0g1c3Pq0p2wQJLybbs1zn67d0+7Q23UPWx1C+IKU3nmX5mGWzToxjVOQASYaZ 8UbzYAjUHtJpLDB4dp6+k5Pv/yfVGyhxXI+qLMWow77qRPPf7/vw5nEwTXmjcPRH 9st0TopZVXI4IEpZP+HeNFdNONuPL3CqV0t1+MnC73WMhmUfXR5E8Yq5H3MscuFl smkrWUq/g+cmkiOw5r4MyadFuN1MsXGw4rOdbYjY4JqVht6gPkOp3P73Hme5rD3H Txw/1WKEl+w3I6wS0Dl/NFcMGOyl8gEv4rATDyRWkxfmCue2mcTGS/3jjjWWguu4 +Q7e6p1390PLAvMV/rEDoYmFCoPSYp6trvupW+5fkZdOyei1SZM= =98LW -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more fixes: - fix unaligned compressed writes in zoned mode - fix false positive lockdep warning when cloning inline extent - remove wrong BUG_ON in tree-log error handling" * tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: fix parallel compressed writes btrfs: zoned: pass start block to btrfs_use_zone_append btrfs: do not BUG_ON in link_to_fixup_dir btrfs: release path before starting transaction when cloning inline extent |
|
|
|
764c7c9a46 |
btrfs: zoned: fix parallel compressed writes
When multiple processes write data to the same block group on a
compressed zoned filesystem, the underlying device could report I/O
errors and data corruption is possible.
This happens because on a zoned file system, compressed data writes
where sent to the device via a REQ_OP_WRITE instead of a
REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel
submission it cannot be guaranteed that the data is always submitted
aligned to the underlying zone's write pointer.
The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
zoned filesystem is non intrusive on a regular file system or when
submitting to a conventional zone on a zoned filesystem, as it is
guarded by btrfs_use_zone_append.
Reported-by: David Sterba <dsterba@suse.com>
Fixes:
|
|
|
|
e380adfc21 |
btrfs: zoned: pass start block to btrfs_use_zone_append
btrfs_use_zone_append only needs the passed in extent_map's block_start member, so there's no need to pass in the full extent map. This also enables the use of btrfs_use_zone_append in places where we only have a start byte but no extent_map. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8ac91e6c60 |
for-5.13-rc2-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCibywACgkQxWXV+ddt
WDs8QhAAlJ1INZGF01lP2mUhzesVIctIAPGBf/77Zsxmcu0rA6E66RVVsYMgGU54
+FWd+LwuFCtC1364OnDa2DnmYtvHfgR4If7EGowpk3qzZFeZQSLqayOFa5tZLYPG
tJStjY32QTerfZRoxPJ1QPcoWjxNMxYqYw/s68G3tTTSHEYtlH9zNHbLm9ny507x
uPHpxqKXdv3/LYHLt6XUypFqsZkMoDW98oOKvo0MZE/fjcqiDcrvAoYe+y8raFC3
FztlfA2TBmmp/PouDXLCspXAksLpVo9mgTQ0kW4K7152cC0X/zWXYNH01uQ+qTAS
OFNKt2DSRIq5TR56ZmReYvRgq0FNMotYpRpxoePSF/rwL+wnsTl7QI3r/d/h/uxQ
IzBeBv1Wd+1ZJcqnmEGx8Mws3nGswKyl4W65x8yin41djVoHgM4tYu3nGqielu+w
ifEBmU5tUGo05z2HA5kpLjDzc6MwWaCIduQvjH/I4Vgo9fhDo6pQO2dyPC50Nkk5
DQ5jfxiXJ/ZSh5NbWtIkB/OQuwkVL1nDy2jtj3qnK06HDKstK1zui5nccFKFNOiX
wtYjnGqd3+vIGIZniMuu9rbPLtG4CCerq44v1gyS6LSEycNW9/r2cOXRaiQk5pej
CoYMdnmAqzwidtn4FZPRNQ7JgyckKCXQQSGCazN2vvLCXisCUrw=
=ue6o
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- fix fiemap to print extents that could get misreported due to
internal extent splitting and logical merging for fiemap output
- fix RCU stalls during delayed iputs
- fix removed dentries still existing after log is synced"
* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removed dentries still existing after log is synced
btrfs: return whole extents in fiemap
btrfs: avoid RCU stalls while running delayed iputs
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
|
|
|
|
91df99a6eb |
btrfs: do not BUG_ON in link_to_fixup_dir
While doing error injection testing I got the following panic kernel BUG at fs/btrfs/tree-log.c:1862! invalid opcode: 0000 [#1] SMP NOPTI CPU: 1 PID: 7836 Comm: mount Not tainted 5.13.0-rc1+ #305 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:link_to_fixup_dir+0xd5/0xe0 RSP: 0018:ffffb5800180fa30 EFLAGS: 00010216 RAX: fffffffffffffffb RBX: 00000000fffffffb RCX: ffff8f595287faf0 RDX: ffffb5800180fa37 RSI: ffff8f5954978800 RDI: 0000000000000000 RBP: ffff8f5953af9450 R08: 0000000000000019 R09: 0000000000000001 R10: 000151f408682970 R11: 0000000120021001 R12: ffff8f5954978800 R13: ffff8f595287faf0 R14: ffff8f5953c77dd0 R15: 0000000000000065 FS: 00007fc5284c8c40(0000) GS:ffff8f59bbd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fc5287f47c0 CR3: 000000011275e002 CR4: 0000000000370ee0 Call Trace: replay_one_buffer+0x409/0x470 ? btree_read_extent_buffer_pages+0xd0/0x110 walk_up_log_tree+0x157/0x1e0 walk_log_tree+0xa6/0x1d0 btrfs_recover_log_trees+0x1da/0x360 ? replay_one_extent+0x7b0/0x7b0 open_ctree+0x1486/0x1720 btrfs_mount_root.cold+0x12/0xea ? __kmalloc_track_caller+0x12f/0x240 legacy_get_tree+0x24/0x40 vfs_get_tree+0x22/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? vfs_parse_fs_string+0x4d/0x90 legacy_get_tree+0x24/0x40 vfs_get_tree+0x22/0xb0 path_mount+0x433/0xa10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x3d/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae We can get -EIO or any number of legitimate errors from btrfs_search_slot(), panicing here is not the appropriate response. The error path for this code handles errors properly, simply return the error. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6416954ca7 |
btrfs: release path before starting transaction when cloning inline extent
When cloning an inline extent there are a few cases, such as when we have an implicit hole at file offset 0, where we start a transaction while holding a read lock on a leaf. Starting the transaction results in a call to sb_start_intwrite(), which results in doing a read lock on a percpu semaphore. Lockdep doesn't like this and complains about it: [46.580704] ====================================================== [46.580752] WARNING: possible circular locking dependency detected [46.580799] 5.13.0-rc1 #28 Not tainted [46.580832] ------------------------------------------------------ [46.580877] cloner/3835 is trying to acquire lock: [46.580918] c00000001301d638 (sb_internal#2){.+.+}-{0:0}, at: clone_copy_inline_extent+0xe4/0x5a0 [46.581167] [46.581167] but task is already holding lock: [46.581217] c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0 [46.581293] [46.581293] which lock already depends on the new lock. [46.581293] [46.581351] [46.581351] the existing dependency chain (in reverse order) is: [46.581410] [46.581410] -> #1 (btrfs-tree-00){++++}-{3:3}: [46.581464] down_read_nested+0x68/0x200 [46.581536] __btrfs_tree_read_lock+0x70/0x1d0 [46.581577] btrfs_read_lock_root_node+0x88/0x200 [46.581623] btrfs_search_slot+0x298/0xb70 [46.581665] btrfs_set_inode_index+0xfc/0x260 [46.581708] btrfs_new_inode+0x26c/0x950 [46.581749] btrfs_create+0xf4/0x2b0 [46.581782] lookup_open.isra.57+0x55c/0x6a0 [46.581855] path_openat+0x418/0xd20 [46.581888] do_filp_open+0x9c/0x130 [46.581920] do_sys_openat2+0x2ec/0x430 [46.581961] do_sys_open+0x90/0xc0 [46.581993] system_call_exception+0x3d4/0x410 [46.582037] system_call_common+0xec/0x278 [46.582078] [46.582078] -> #0 (sb_internal#2){.+.+}-{0:0}: [46.582135] __lock_acquire+0x1e90/0x2c50 [46.582176] lock_acquire+0x2b4/0x5b0 [46.582263] start_transaction+0x3cc/0x950 [46.582308] clone_copy_inline_extent+0xe4/0x5a0 [46.582353] btrfs_clone+0x5fc/0x880 [46.582388] btrfs_clone_files+0xd8/0x1c0 [46.582434] btrfs_remap_file_range+0x3d8/0x590 [46.582481] do_clone_file_range+0x10c/0x270 [46.582558] vfs_clone_file_range+0x1b0/0x310 [46.582605] ioctl_file_clone+0x90/0x130 [46.582651] do_vfs_ioctl+0x874/0x1ac0 [46.582697] sys_ioctl+0x6c/0x120 [46.582733] system_call_exception+0x3d4/0x410 [46.582777] system_call_common+0xec/0x278 [46.582822] [46.582822] other info that might help us debug this: [46.582822] [46.582888] Possible unsafe locking scenario: [46.582888] [46.582942] CPU0 CPU1 [46.582984] ---- ---- [46.583028] lock(btrfs-tree-00); [46.583062] lock(sb_internal#2); [46.583119] lock(btrfs-tree-00); [46.583174] lock(sb_internal#2); [46.583212] [46.583212] *** DEADLOCK *** [46.583212] [46.583266] 6 locks held by cloner/3835: [46.583299] #0: c00000001301d448 (sb_writers#12){.+.+}-{0:0}, at: ioctl_file_clone+0x90/0x130 [46.583382] #1: c00000000f6d3768 (&sb->s_type->i_mutex_key#15){+.+.}-{3:3}, at: lock_two_nondirectories+0x58/0xc0 [46.583477] #2: c00000000f6d72a8 (&sb->s_type->i_mutex_key#15/4){+.+.}-{3:3}, at: lock_two_nondirectories+0x9c/0xc0 [46.583574] #3: c00000000f6d7138 (&ei->i_mmap_lock){+.+.}-{3:3}, at: btrfs_remap_file_range+0xd0/0x590 [46.583657] #4: c00000000f6d35f8 (&ei->i_mmap_lock/1){+.+.}-{3:3}, at: btrfs_remap_file_range+0xe0/0x590 [46.583743] #5: c000000007fa2550 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x70/0x1d0 [46.583828] [46.583828] stack backtrace: [46.583872] CPU: 1 PID: 3835 Comm: cloner Not tainted 5.13.0-rc1 #28 [46.583931] Call Trace: [46.583955] [c0000000167c7200] [c000000000c1ee78] dump_stack+0xec/0x144 (unreliable) [46.584052] [c0000000167c7240] [c000000000274058] print_circular_bug.isra.32+0x3a8/0x400 [46.584123] [c0000000167c72e0] [c0000000002741f4] check_noncircular+0x144/0x190 [46.584191] [c0000000167c73b0] [c000000000278fc0] __lock_acquire+0x1e90/0x2c50 [46.584259] [c0000000167c74f0] [c00000000027aa94] lock_acquire+0x2b4/0x5b0 [46.584317] [c0000000167c75e0] [c000000000a0d6cc] start_transaction+0x3cc/0x950 [46.584388] [c0000000167c7690] [c000000000af47a4] clone_copy_inline_extent+0xe4/0x5a0 [46.584457] [c0000000167c77c0] [c000000000af525c] btrfs_clone+0x5fc/0x880 [46.584514] [c0000000167c7990] [c000000000af5698] btrfs_clone_files+0xd8/0x1c0 [46.584583] [c0000000167c7a00] [c000000000af5b58] btrfs_remap_file_range+0x3d8/0x590 [46.584652] [c0000000167c7ae0] [c0000000005d81dc] do_clone_file_range+0x10c/0x270 [46.584722] [c0000000167c7b40] [c0000000005d84f0] vfs_clone_file_range+0x1b0/0x310 [46.584793] [c0000000167c7bb0] [c00000000058bf80] ioctl_file_clone+0x90/0x130 [46.584861] [c0000000167c7c10] [c00000000058c894] do_vfs_ioctl+0x874/0x1ac0 [46.584922] [c0000000167c7d10] [c00000000058db4c] sys_ioctl+0x6c/0x120 [46.584978] [c0000000167c7d60] [c0000000000364a4] system_call_exception+0x3d4/0x410 [46.585046] [c0000000167c7e10] [c00000000000d45c] system_call_common+0xec/0x278 [46.585114] --- interrupt: c00 at 0x7ffff7e22990 [46.585160] NIP: 00007ffff7e22990 LR: 00000001000010ec CTR: 0000000000000000 [46.585224] REGS: c0000000167c7e80 TRAP: 0c00 Not tainted (5.13.0-rc1) [46.585280] MSR: 800000000280f033 <SF,VEC,VSX,EE,PR,FP,ME,IR,DR,RI,LE> CR: 28000244 XER: 00000000 [46.585374] IRQMASK: 0 [46.585374] GPR00: 0000000000000036 00007fffffffdec0 00007ffff7f17100 0000000000000004 [46.585374] GPR04: 000000008020940d 00007fffffffdf40 0000000000000000 0000000000000000 [46.585374] GPR08: 0000000000000004 0000000000000000 0000000000000000 0000000000000000 [46.585374] GPR12: 0000000000000000 00007ffff7ffa940 0000000000000000 0000000000000000 [46.585374] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [46.585374] GPR20: 0000000000000000 000000009123683e 00007fffffffdf40 0000000000000000 [46.585374] GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000004 [46.585374] GPR28: 0000000100030260 0000000100030280 0000000000000003 000000000000005f [46.585919] NIP [00007ffff7e22990] 0x7ffff7e22990 [46.585964] LR [00000001000010ec] 0x1000010ec [46.586010] --- interrupt: c00 This should be a false positive, as both locks are acquired in read mode. Nevertheless, we don't need to hold a leaf locked when we start the transaction, so just release the leaf (path) before starting it. Reported-by: Ritesh Harjani <riteshh@linux.ibm.com> Link: https://lore.kernel.org/linux-btrfs/20210513214404.xks77p566fglzgum@riteshh-domain/ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
54a40fc3a1 |
btrfs: fix removed dentries still existing after log is synced
When we move one inode from one directory to another and both the inode
and its previous parent directory were logged before, we are not supposed
to have the dentry for the old parent if we have a power failure after the
log is synced. Only the new dentry is supposed to exist.
Generally this works correctly, however there is a scenario where this is
not currently working, because the old parent of the file/directory that
was moved is not authoritative for a range that includes the dir index and
dir item keys of the old dentry. This case is better explained with the
following example and reproducer:
# The test requires a very specific layout of keys and items in the
# fs/subvolume btree to trigger the bug. So we want to make sure that
# on whatever platform we are, we have the same leaf/node size.
#
# Currently in btrfs the node/leaf size can not be smaller than the page
# size (but it can be greater than the page size). So use the largest
# supported node/leaf size (64K).
$ mkfs.btrfs -f -n 65536 /dev/sdc
$ mount /dev/sdc /mnt
# "testdir" is inode 257.
$ mkdir /mnt/testdir
$ chmod 755 /mnt/testdir
# Create several empty files to have the directory "testdir" with its
# items spread over several leaves (7 in this case).
$ for ((i = 1; i <= 1200; i++)); do
echo -n > /mnt/testdir/file$i
done
# Create our test directory "dira", inode number 1458, which gets all
# its items in leaf 7.
#
# The BTRFS_DIR_ITEM_KEY item for inode 257 ("testdir") that points to
# the entry named "dira" is in leaf 2, while the BTRFS_DIR_INDEX_KEY
# item that points to that entry is in leaf 3.
#
# For this particular filesystem node size (64K), file count and file
# names, we endup with the directory entry items from inode 257 in
# leaves 2 and 3, as previously mentioned - what matters for triggering
# the bug exercised by this test case is that those items are not placed
# in leaf 1, they must be placed in a leaf different from the one
# containing the inode item for inode 257.
#
# The corresponding BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items for
# the parent inode (257) are the following:
#
# item 460 key (257 DIR_ITEM 3724298081) itemoff 48344 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
#
# and:
#
# item 771 key (257 DIR_INDEX 1202) itemoff 36673 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
$ mkdir /mnt/testdir/dira
# Make sure everything done so far is durably persisted.
$ sync
# Now do a change to inode 257 ("testdir") that does not result in
# COWing leaves 2 and 3 - the leaves that contain the directory items
# pointing to inode 1458 (directory "dira").
#
# Changing permissions, the owner/group, updating or adding a xattr,
# etc, will not change (COW) leaves 2 and 3. So for the sake of
# simplicity change the permissions of inode 257, which results in
# updating its inode item and therefore change (COW) only leaf 1.
$ chmod 700 /mnt/testdir
# Now fsync directory inode 257.
#
# Since only the first leaf was changed/COWed, we log the inode item of
# inode 257 and only the dentries found in the first leaf, all have a
# key type of BTRFS_DIR_ITEM_KEY, and no keys of type
# BTRFS_DIR_INDEX_KEY, because they sort after the former type and none
# exist in the first leaf.
#
# We also log 3 items that represent ranges for dir items and dir
# indexes for which the log is authoritative:
#
# 1) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [0, 2285968570] (the offset here is the crc32c of the
# dentry's name). The value 2285968570 corresponds to the offset of
# the first key of leaf 2 (which is of type BTRFS_DIR_ITEM_KEY);
#
# 2) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [4293818216, (u64)-1] (the offset here is the crc32c
# of the dentry's name). The value 4293818216 corresponds to the
# offset of the highest key of type BTRFS_DIR_ITEM_KEY plus 1
# (4293818215 + 1), which is located in leaf 2;
#
# 3) a key of type BTRFS_DIR_LOG_INDEX_KEY, with an offset of 1203,
# which indicates the log is authoritative for all keys of type
# BTRFS_DIR_INDEX_KEY that have an offset in the range
# [1203, (u64)-1]. The value 1203 corresponds to the offset of the
# last key of type BTRFS_DIR_INDEX_KEY plus 1 (1202 + 1), which is
# located in leaf 3;
#
# Also, because "testdir" is a directory and inode 1458 ("dira") is a
# child directory, we log inode 1458 too.
$ xfs_io -c "fsync" /mnt/testdir
# Now move "dira", inode 1458, to be a child of the root directory
# (inode 256).
#
# Because this inode was previously logged, when "testdir" was fsynced,
# the log is updated so that the old inode reference, referring to inode
# 257 as the parent, is deleted and the new inode reference, referring
# to inode 256 as the parent, is added to the log.
$ mv /mnt/testdir/dira /mnt
# Now change some file and fsync it. This guarantees the log changes
# made by the previous move/rename operation are persisted. We do not
# need to do any special modification to the file, just any change to
# any file and sync the log.
$ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/testdir/file1
# Simulate a power failure and then mount again the filesystem to
# replay the log tree. We want to verify that we are able to mount the
# filesystem, meaning log replay was successful, and that directory
# inode 1458 ("dira") only has inode 256 (the filesystem's root) as
# its parent (and no longer a child of inode 257).
#
# It used to happen that during log replay we would end up having
# inode 1458 (directory "dira") with 2 hard links, being a child of
# inode 257 ("testdir") and inode 256 (the filesystem's root). This
# resulted in the tree checker detecting the issue and causing the
# mount operation to fail (with -EIO).
#
# This happened because in the log we have the new name/parent for
# inode 1458, which results in adding the new dentry with inode 256
# as the parent, but the previous dentry, under inode 257 was never
# removed - this is because the ranges for dir items and dir indexes
# of inode 257 for which the log is authoritative do not include the
# old dir item and dir index for the dentry of inode 257 referring to
# inode 1458:
#
# - for dir items, the log is authoritative for the ranges
# [0, 2285968570] and [4293818216, (u64)-1]. The dir item at inode 257
# pointing to inode 1458 has a key of (257 DIR_ITEM 3724298081), as
# previously mentioned, so the dir item is not deleted when the log
# replay procedure processes the authoritative ranges, as 3724298081
# is outside both ranges;
#
# - for dir indexes, the log is authoritative for the range
# [1203, (u64)-1], and the dir index item of inode 257 pointing to
# inode 1458 has a key of (257 DIR_INDEX 1202), as previously
# mentioned, so the dir index item is not deleted when the log
# replay procedure processes the authoritative range.
<power failure>
$ mount /dev/sdc /mnt
mount: /mnt: can't read superblock on /dev/sdc.
$ dmesg
(...)
[87849.840509] BTRFS info (device sdc): start tree-log replay
[87849.875719] BTRFS critical (device sdc): corrupt leaf: root=5 block=30539776 slot=554 ino=1458, invalid nlink: has 2 expect no more than 1 for dir
[87849.878084] BTRFS info (device sdc): leaf 30539776 gen 7 total ptrs 557 free space 2092 owner 5
[87849.879516] BTRFS info (device sdc): refs 1 lock_owner 0 current 2099108
[87849.880613] item 0 key (1181 1 0) itemoff 65275 itemsize 160
[87849.881544] inode generation 6 size 0 mode 100644
[87849.882692] item 1 key (1181 12 257) itemoff 65258 itemsize 17
(...)
[87850.562549] item 556 key (1458 12 257) itemoff 16017 itemsize 14
[87850.563349] BTRFS error (device dm-0): block=30539776 write time tree block corruption detected
[87850.564386] ------------[ cut here ]------------
[87850.564920] WARNING: CPU: 3 PID: 2099108 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[87850.566129] Modules linked in: btrfs dm_zero dm_snapshot (...)
[87850.573789] CPU: 3 PID: 2099108 Comm: mount Not tainted 5.12.0-rc8-btrfs-next-86 #1
(...)
[87850.587481] Call Trace:
[87850.587768] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[87850.588354] ? btrfs_bio_fits_in_stripe+0xd8/0x110 [btrfs]
[87850.589003] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[87850.589654] submit_one_bio+0x61/0x70 [btrfs]
[87850.590248] submit_extent_page+0x91/0x2f0 [btrfs]
[87850.590842] write_one_eb+0x175/0x440 [btrfs]
[87850.591370] ? find_extent_buffer_nolock+0x1c0/0x1c0 [btrfs]
[87850.592036] btree_write_cache_pages+0x1e6/0x610 [btrfs]
[87850.592665] ? free_debug_processing+0x1d5/0x240
[87850.593209] do_writepages+0x43/0xf0
[87850.593798] ? __filemap_fdatawrite_range+0xa4/0x100
[87850.594391] __filemap_fdatawrite_range+0xc5/0x100
[87850.595196] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[87850.596202] btrfs_write_and_wait_transaction.isra.0+0x4d/0xd0 [btrfs]
[87850.597377] btrfs_commit_transaction+0x794/0xca0 [btrfs]
[87850.598455] ? _raw_spin_unlock_irqrestore+0x32/0x60
[87850.599305] ? kmem_cache_free+0x15a/0x3d0
[87850.600029] btrfs_recover_log_trees+0x346/0x380 [btrfs]
[87850.601021] ? replay_one_extent+0x7d0/0x7d0 [btrfs]
[87850.601988] open_ctree+0x13c9/0x1698 [btrfs]
[87850.602846] btrfs_mount_root.cold+0x13/0xed [btrfs]
[87850.603771] ? kmem_cache_alloc_trace+0x7c9/0x930
[87850.604576] ? vfs_parse_fs_string+0x5d/0xb0
[87850.605293] ? kfree+0x276/0x3f0
[87850.605857] legacy_get_tree+0x30/0x50
[87850.606540] vfs_get_tree+0x28/0xc0
[87850.607163] fc_mount+0xe/0x40
[87850.607695] vfs_kern_mount.part.0+0x71/0x90
[87850.608440] btrfs_mount+0x13b/0x3e0 [btrfs]
(...)
[87850.629477] ---[ end trace 68802022b99a1ea0 ]---
[87850.630849] BTRFS: error (device sdc) in btrfs_commit_transaction:2381: errno=-5 IO failure (Error while writing out transaction)
[87850.632422] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[87850.633416] BTRFS: error (device sdc) in cleanup_transaction:1978: errno=-5 IO failure
[87850.634553] BTRFS: error (device sdc) in btrfs_replay_log:2431: errno=-5 IO failure (Failed to recover log tree)
[87850.637529] BTRFS error (device sdc): open_ctree failed
In this example the inode we moved was a directory, so it was easy to
detect the problem because directories can only have one hard link and
the tree checker immediately detects that. If the moved inode was a file,
then the log replay would succeed and we would end up having both the
new hard link (/mnt/foo) and the old hard link (/mnt/testdir/foo) present,
but only the new one should be present.
Fix this by forcing re-logging of the old parent directory when logging
the new name during a rename operation. This ensures we end up with a log
that is authoritative for a range covering the keys for the old dentry,
therefore causing the old dentry do be deleted when replaying the log.
A test case for fstests will follow up soon.
Fixes:
|
|
|
|
15c7745c9a |
btrfs: return whole extents in fiemap
`xfs_io -c 'fiemap <off> <len>' <file>`
can give surprising results on btrfs that differ from xfs.
btrfs prints out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.
Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.
Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630
I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.
after the fix, all the above commands give:
0: [0..7]: 26624..26631
The merging logic is exercised by fstest generic/483, and I have written
a new fstest for checking we don't have backwards or zero-length fiemaps
for cases like those above.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
71795ee590 |
btrfs: avoid RCU stalls while running delayed iputs
Generally a delayed iput is added when we might do the final iput, so usually we'll end up sleeping while processing the delayed iputs naturally. However there's no guarantee of this, especially for small files. In production we noticed 5 instances of RCU stalls while testing a kernel release overnight across 1000 machines, so this is relatively common: host count: 5 rcu: INFO: rcu_sched self-detected stall on CPU rcu: ....: (20998 ticks this GP) idle=59e/1/0x4000000000000002 softirq=12333372/12333372 fqs=3208 (t=21031 jiffies g=27810193 q=41075) NMI backtrace for cpu 1 CPU: 1 PID: 1713 Comm: btrfs-cleaner Kdump: loaded Not tainted 5.6.13-0_fbk12_rc1_5520_gec92bffc1ec9 #1 Call Trace: <IRQ> dump_stack+0x50/0x70 nmi_cpu_backtrace.cold.6+0x30/0x65 ? lapic_can_unplug_cpu.cold.30+0x40/0x40 nmi_trigger_cpumask_backtrace+0xba/0xca rcu_dump_cpu_stacks+0x99/0xc7 rcu_sched_clock_irq.cold.90+0x1b2/0x3a3 ? trigger_load_balance+0x5c/0x200 ? tick_sched_do_timer+0x60/0x60 ? tick_sched_do_timer+0x60/0x60 update_process_times+0x24/0x50 tick_sched_timer+0x37/0x70 __hrtimer_run_queues+0xfe/0x270 hrtimer_interrupt+0xf4/0x210 smp_apic_timer_interrupt+0x5e/0x120 apic_timer_interrupt+0xf/0x20 </IRQ> RIP: 0010:queued_spin_lock_slowpath+0x17d/0x1b0 RSP: 0018:ffffc9000da5fe48 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000000 RBX: ffff889fa81d0cd8 RCX: 0000000000000029 RDX: ffff889fff86c0c0 RSI: 0000000000080000 RDI: ffff88bfc2da7200 RBP: ffff888f2dcdd768 R08: 0000000001040000 R09: 0000000000000000 R10: 0000000000000001 R11: ffffffff82a55560 R12: ffff88bfc2da7200 R13: 0000000000000000 R14: ffff88bff6c2a360 R15: ffffffff814bd870 ? kzalloc.constprop.57+0x30/0x30 list_lru_add+0x5a/0x100 inode_lru_list_add+0x20/0x40 iput+0x1c1/0x1f0 run_delayed_iput_locked+0x46/0x90 btrfs_run_delayed_iputs+0x3f/0x60 cleaner_kthread+0xf2/0x120 kthread+0x10b/0x130 Fix this by adding a cond_resched_lock() to the loop processing delayed iputs so we can avoid these sort of stalls. CC: stable@vger.kernel.org # 4.9+ Reviewed-by: Rik van Riel <riel@surriel.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d6f67afbdf |
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
Commit |
|
|
|
88b06399c9 |
for-5.13-rc1-part2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCaiuQACgkQxWXV+ddt WDv3Ww//bDUlNXqAYEoLKePohy1bupiqG8lKYX4s4bGEq0x0cyh4qVER/Q/lU2l2 AMf8t6Pwr/iBOPwfckreLDuFrhacvWq0K4eMkgpf++3P0Mzbj2sIBX0+XnrWluRL yFCZudJej+cpM55Ve4l6M8zrk1nbzYJLFPRRdOIFe4HonWkhI/zY6RD7kFybQevW mAxqMgIpUQAjoj5F/EhwXQ9dk6PXSZj+gaOoNrmQmN7mZMqNgSLHBEoJUHrotm1K rDlEwIRUTtNPV+rcPxcXD1GFiUxU0cZhg0jts252z89Mvaqb2g/YKaHPAR/IVIt5 enf4llZzoEeiMnHuSj9zCg4HxOvCCFV8zZYXlO7/9IqdgLJjQkElZoqTz45obWdE aoJrHAWWlulS2jPocJfJ/Zti2xBYGLjQASH0kYS+vjVxjKyqz3fuM1Tsasaf9Mcp +M2m6yMBjJ0nJMTL2CgBksCd0dHwfiBZ/YYClrMSjYlzYSU6ofA2b2hej0OjqZ4X FmpEmCBK4lySdJI+JlJKikeneOOxKSpT0xGqU+OMmbpwFH3k1N3oseu0hrG8Xreo RU1xNbekGTwRbCcCA9l5HQ/RYptT7rt/KqkC70UFEvdIijCNcptOGaTAoYvLS14O T+yu0Cizt7O0Fdg5E+MAS/qaI2yacXxBfIkMDbPxHGUg7+vUteM= =Phtq -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "Handle transaction start error in btrfs_fileattr_set() This is fix for code introduced by the new fileattr merge" * tag 'for-5.13-rc1-part2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: handle transaction start error in btrfs_fileattr_set |
|
|
|
9b8a233bc2 |
btrfs: handle transaction start error in btrfs_fileattr_set
Add error handling in btrfs_fileattr_set in case of an error while
starting a transaction. This fixes btrfs/232 which otherwise used to
fail with below signature on Power.
btrfs/232 [ 1119.474650] run fstests btrfs/232 at 2021-04-21 02:21:22
<...>
[ 1366.638585] BUG: Unable to handle kernel data access on read at 0xffffffffffffff86
[ 1366.638768] Faulting instruction address: 0xc0000000009a5c88
cpu 0x0: Vector: 380 (Data SLB Access) at [c000000014f177b0]
pc: c0000000009a5c88: btrfs_update_root_times+0x58/0xc0
lr: c0000000009a5c84: btrfs_update_root_times+0x54/0xc0
<...>
pid = 24881, comm = fsstress
btrfs_update_inode+0xa0/0x140
btrfs_fileattr_set+0x5d0/0x6f0
vfs_fileattr_set+0x2a8/0x390
do_vfs_ioctl+0x1290/0x1ac0
sys_ioctl+0x6c/0x120
system_call_exception+0x3d4/0x410
system_call_common+0xec/0x278
Fixes:
|
|
|
|
142b507f91 |
for-5.13-rc1-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCZnCIACgkQxWXV+ddt WDuEvhAAmC+Mkrz25GbQnSIp2FKYCCQK34D0rdghml0Bc0cJcDh3yhgIB6ZTHZ7e Z+UZu84ISK31OHKDzXtX0MINN2wuU4u4kd6PHtYj0wSVl3cX6E/K5j6YcThfI1Ru vCW5O87V9SCV5NnykIFt3sbYvsPKtF9lhgPQprj4np+wxaSyNlEF2c+zLTI3J7NV +8OlM4oi8GocZd1aAwGpVM3qUPyQSHEb9oUEp6aV1ERuAs6LIyeGks3Cag6gjPnq dYz3jV9HyZB5GtX0dmv4LeRFIog1uFi+SIEFl5RpqhB3sXN3n6XHMka4x20FXiWy PfX9+Nf4bQGx6F9rGsgHNHQP5dVhHAkZcq3E0n0yshIfNe8wDHBRlmk0wbfj4K7I VYv85SxEYpigG8KzF5gjiar4EqsaJVQcJioMxVE7z9vrW6xlOWD1lf/ViUZnB3wd IQEyGz2qOe9eqJD+dnyN7QkN9WKGSUr2p1Q/DngCIwFzKWf1qIlETNXrIL+AZ97r v4G5mMq9dCxs3s8c5SGbdF9qqK8gEuaV3iWQAoKOciuy6fbc553Q90I1v3OhW+by j2yVoo3nJbBJBuLBNWPDUlwxQF/EHPQ6nh3fvxNRgwksXgRmqywdJb5dQ8hcKgSH RsvinJhtKo5rTgtgGgmNvmLAjKIieW1lIVG4ha0O/m49HeaohDE= =GNNs -----END PGP SIGNATURE----- Merge tag 'for-5.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "First batch of various fixes, here's a list of notable ones: - fix unmountable seed device after fstrim - fix silent data loss in zoned mode due to ordered extent splitting - fix race leading to unpersisted data and metadata on fsync - fix deadlock when cloning inline extents and using qgroups" * tag 'for-5.13-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize return variable in cleanup_free_space_cache_v1 btrfs: zoned: sanity check zone type btrfs: fix unmountable seed device after fstrim btrfs: fix deadlock when cloning inline extents and using qgroups btrfs: fix race leading to unpersisted data and metadata on fsync btrfs: do not consider send context as valid when trying to flush qgroups btrfs: zoned: fix silent data loss after failure splitting ordered extent |
|
|
|
d048b9c2a7 |
btrfs: use memzero_page() instead of open coded kmap pattern
There are many places where kmap/memset/kunmap patterns occur. Use the newly lifted memzero_page() to eliminate direct uses of kmap and leverage the new core functions use of kmap_local_page(). The development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memset/kunmap pattern and replace with memset*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then the memset pattern // @ memset_rule1 @ expression page, V, L, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memset(ptr, 0, L); +memzero_page(page, 0, L); | -memset(ptr + Off, 0, L); +memzero_page(page, Off, L); | -memset(ptr, V, L); +memset_page(page, V, 0, L); | -memset(ptr + Off, V, L); +memset_page(page, V, Off, L); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule1 @ identifier memset_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Catch all // @ memset_rule2 @ expression page; identifier ptr; expression GenTo, GenSize, GenValue; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memset/memcpy // The follow are catch alls which need to be evaluated by hand. // -memset(GenTo, 0, GenSize); +memzero_pageExtra(page, GenTo, GenSize); | -memset(GenTo, GenValue, GenSize); +memset_pageExtra(page, GenValue, GenTo, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memset_rule2 @ identifier memset_rule2.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // </smpl> Link: https://lkml.kernel.org/r/20210309212137.2610186-4-ira.weiny@intel.com Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
77364faf21 |
btrfs: initialize return variable in cleanup_free_space_cache_v1
Static analysis reports this problem
free-space-cache.c:3965:2: warning: Undefined or garbage value returned
return ret;
^~~~~~~~~~
ret is set in the node handling loop. Treat doing nothing as a success
and initialize ret to 0, although it's unlikely the loop would be
skipped. We always have block groups, but as it could lead to
transaction abort in the caller it's better to be safe.
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Tom Rix <trix@redhat.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
784daf2b96 |
btrfs: zoned: sanity check zone type
The fstests test case generic/475 creates a dm-linear device that gets changed to a dm-error device. This leads to errors in loading the block group's zone information when running on a zoned file system, ultimately resulting in a list corruption. When running on a kernel with list debugging enabled this leads to the following crash. BTRFS: error (device dm-2) in cleanup_transaction:1953: errno=-5 IO failure kernel BUG at lib/list_debug.c:54! invalid opcode: 0000 [#1] SMP PTI CPU: 1 PID: 2433 Comm: umount Tainted: G W 5.12.0+ #1018 RIP: 0010:__list_del_entry_valid.cold+0x1d/0x47 RSP: 0018:ffffc90001473df0 EFLAGS: 00010296 RAX: 0000000000000054 RBX: ffff8881038fd000 RCX: ffffc90001473c90 RDX: 0000000100001a31 RSI: 0000000000000003 RDI: 0000000000000003 RBP: ffff888308871108 R08: 0000000000000003 R09: 0000000000000001 R10: 3961373532383838 R11: 6666666620736177 R12: ffff888308871000 R13: ffff8881038fd088 R14: ffff8881038fdc78 R15: dead000000000100 FS: 00007f353c9b1540(0000) GS:ffff888627d00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f353cc2c710 CR3: 000000018e13c000 CR4: 00000000000006a0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0xc9/0x310 [btrfs] close_ctree+0x2ee/0x31a [btrfs] ? call_rcu+0x8f/0x270 ? mutex_lock+0x1c/0x40 generic_shutdown_super+0x67/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x90 cleanup_mnt+0x13e/0x1b0 task_work_run+0x63/0xb0 exit_to_user_mode_loop+0xd9/0xe0 exit_to_user_mode_prepare+0x3e/0x60 syscall_exit_to_user_mode+0x1d/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xae As dm-error has no support for zones, btrfs will run it's zone emulation mode on this device. The zone emulation mode emulates conventional zones, so bail out if the zone bitmap that gets populated on mount sees the zone as sequential while we're thinking it's a conventional zone when creating a block group. Note: this scenario is unlikely in a real wold application and can only happen by this (ab)use of device-mapper targets. CC: stable@vger.kernel.org # 5.12+ Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5e753a817b |
btrfs: fix unmountable seed device after fstrim
The following test case reproduces an issue of wrongly freeing in-use
blocks on the readonly seed device when fstrim is called on the rw sprout
device. As shown below.
Create a seed device and add a sprout device to it:
$ mkfs.btrfs -fq -dsingle -msingle /dev/loop0
$ btrfstune -S 1 /dev/loop0
$ mount /dev/loop0 /btrfs
$ btrfs dev add -f /dev/loop1 /btrfs
BTRFS info (device loop0): relocating block group 290455552 flags system
BTRFS info (device loop0): relocating block group 1048576 flags system
BTRFS info (device loop0): disk added /dev/loop1
$ umount /btrfs
Mount the sprout device and run fstrim:
$ mount /dev/loop1 /btrfs
$ fstrim /btrfs
$ umount /btrfs
Now try to mount the seed device, and it fails:
$ mount /dev/loop0 /btrfs
mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error.
Block 5292032 is missing on the readonly seed device:
$ dmesg -kt | tail
<snip>
BTRFS error (device loop0): bad tree block start, want 5292032 have 0
BTRFS warning (device loop0): couldn't read-tree root
BTRFS error (device loop0): open_ctree failed
From the dump-tree of the seed device (taken before the fstrim). Block
5292032 belonged to the block group starting at 5242880:
$ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP
<snip>
item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24
block group used 114688 chunk_objectid 256 flags METADATA
<snip>
From the dump-tree of the sprout device (taken before the fstrim).
fstrim used block-group 5242880 to find the related free space to free:
$ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP
<snip>
item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24
block group used 32768 chunk_objectid 256 flags METADATA
<snip>
BPF kernel tracing the fstrim command finds the missing block 5292032
within the range of the discarded blocks as below:
kprobe:btrfs_discard_extent {
printf("freeing start %llu end %llu num_bytes %llu:\n",
arg1, arg1+arg2, arg2);
}
freeing start
|
|
|
|
f9baa501b4 |
btrfs: fix deadlock when cloning inline extents and using qgroups
There are a few exceptional cases where cloning an inline extent needs to copy the inline extent data into a page of the destination inode. When this happens, we end up starting a transaction while having a dirty page for the destination inode and while having the range locked in the destination's inode iotree too. Because when reserving metadata space for a transaction we may need to flush existing delalloc in case there is not enough free space, we have a mechanism in place to prevent a deadlock, which was introduced in commit |
|
|
|
626e9f41f7 |
btrfs: fix race leading to unpersisted data and metadata on fsync
When doing a fast fsync on a file, there is a race which can result in the
fsync returning success to user space without logging the inode and without
durably persisting new data.
The following example shows one possible scenario for this:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
$ touch /mnt/bar
$ xfs_io -f -c "pwrite -S 0xab 0 1M" -c "fsync" /mnt/baz
# Now we have:
# file bar == inode 257
# file baz == inode 258
$ mv /mnt/baz /mnt/foo
# Now we have:
# file bar == inode 257
# file foo == inode 258
$ xfs_io -c "pwrite -S 0xcd 0 1M" /mnt/foo
# fsync bar before foo, it is important to trigger the race.
$ xfs_io -c "fsync" /mnt/bar
$ xfs_io -c "fsync" /mnt/foo
# After this:
# inode 257, file bar, is empty
# inode 258, file foo, has 1M filled with 0xcd
<power failure>
# Replay the log:
$ mount /dev/sdc /mnt
# After this point file foo should have 1M filled with 0xcd and not 0xab
The following steps explain how the race happens:
1) Before the first fsync of inode 258, when it has the "baz" name, its
->logged_trans is 0, ->last_sub_trans is 0 and ->last_log_commit is -1.
The inode also has the full sync flag set;
2) After the first fsync, we set inode 258 ->logged_trans to 6, which is
the generation of the current transaction, and set ->last_log_commit
to 0, which is the current value of ->last_sub_trans (done at
btrfs_log_inode()).
The full sync flag is cleared from the inode during the fsync.
The log sub transaction that was committed had an ID of 0 and when we
synced the log, at btrfs_sync_log(), we incremented root->log_transid
from 0 to 1;
3) During the rename:
We update inode 258, through btrfs_update_inode(), and that causes its
->last_sub_trans to be set to 1 (the current log transaction ID), and
->last_log_commit remains with a value of 0.
After updating inode 258, because we have previously logged the inode
in the previous fsync, we log again the inode through the call to
btrfs_log_new_name(). This results in updating the inode's
->last_log_commit from 0 to 1 (the current value of its
->last_sub_trans).
The ->last_sub_trans of inode 257 is updated to 1, which is the ID of
the next log transaction;
4) Then a buffered write against inode 258 is made. This leaves the value
of ->last_sub_trans as 1 (the ID of the current log transaction, stored
at root->log_transid);
5) Then an fsync against inode 257 (or any other inode other than 258),
happens. This results in committing the log transaction with ID 1,
which results in updating root->last_log_commit to 1 and bumping
root->log_transid from 1 to 2;
6) Then an fsync against inode 258 starts. We flush delalloc and wait only
for writeback to complete, since the full sync flag is not set in the
inode's runtime flags - we do not wait for ordered extents to complete.
Then, at btrfs_sync_file(), we call btrfs_inode_in_log() before the
ordered extent completes. The call returns true:
static inline bool btrfs_inode_in_log(...)
{
bool ret = false;
spin_lock(&inode->lock);
if (inode->logged_trans == generation &&
inode->last_sub_trans <= inode->last_log_commit &&
inode->last_sub_trans <= inode->root->last_log_commit)
ret = true;
spin_unlock(&inode->lock);
return ret;
}
generation has a value of 6 (fs_info->generation), ->logged_trans also
has a value of 6 (set when we logged the inode during the first fsync
and when logging it during the rename), ->last_sub_trans has a value
of 1, set during the rename (step 3), ->last_log_commit also has a
value of 1 (set in step 3) and root->last_log_commit has a value of 1,
which was set in step 5 when fsyncing inode 257.
As a consequence we don't log the inode, any new extents and do not
sync the log, resulting in a data loss if a power failure happens
after the fsync and before the current transaction commits.
Also, because we do not log the inode, after a power failure the mtime
and ctime of the inode do not match those we had before.
When the ordered extent completes before we call btrfs_inode_in_log(),
then the call returns false and we log the inode and sync the log,
since at the end of ordered extent completion we update the inode and
set ->last_sub_trans to 2 (the value of root->log_transid) and
->last_log_commit to 1.
This problem is found after removing the check for the emptiness of the
inode's list of modified extents in the recent commit
|
|
|
|
ffb7c2e923 |
btrfs: do not consider send context as valid when trying to flush qgroups
At qgroup.c:try_flush_qgroup() we are asserting that current->journal_info is either NULL or has the value BTRFS_SEND_TRANS_STUB. However allowing for BTRFS_SEND_TRANS_STUB makes no sense because: 1) It is misleading, because send operations are read-only and do not ever need to reserve qgroup space; 2) We already assert that current->journal_info != BTRFS_SEND_TRANS_STUB at transaction.c:start_transaction(); 3) On a kernel without CONFIG_BTRFS_ASSERT=y set, it would result in a crash if try_flush_qgroup() is ever called in a send context, because at transaction.c:start_transaction we cast current->journal_info into a struct btrfs_trans_handle pointer and then dereference it. So just do allow a send context at try_flush_qgroup() and update the comment about it. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
adbd914dcd |
btrfs: zoned: fix silent data loss after failure splitting ordered extent
On a zoned filesystem, sometimes we need to split an ordered extent into 3
different ordered extents. The original ordered extent is shortened, at
the front and at the rear, and we create two other new ordered extents to
represent the trimmed parts of the original ordered extent.
After adjusting the original ordered extent, we create an ordered extent
to represent the pre-range, and that may fail with ENOMEM for example.
After that we always try to create the ordered extent for the post-range,
and if that happens to succeed we end up returning success to the caller
as we overwrite the 'ret' variable which contained the previous error.
This means we end up with a file range for which there is no ordered
extent, which results in the range never getting a new file extent item
pointing to the new data location. And since the split operation did
not return an error, writeback does not fail and the inode's mapping is
not flagged with an error, resulting in a subsequent fsync not reporting
an error either.
It's possibly very unlikely to have the creation of the post-range ordered
extent succeed after the creation of the pre-range ordered extent failed,
but it's not impossible.
So fix this by making sure we only create the post-range ordered extent
if there was no error creating the ordered extent for the pre-range.
Fixes:
|
|
|
|
a4f7fae101 |
Merge branch 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fileattr conversion updates from Miklos Szeredi via Al Viro: "This splits the handling of FS_IOC_[GS]ETFLAGS from ->ioctl() into a separate method. The interface is reasonably uniform across the filesystems that support it and gives nice boilerplate removal" * 'miklos.fileattr' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (23 commits) ovl: remove unneeded ioctls fuse: convert to fileattr fuse: add internal open/release helpers fuse: unsigned open flags fuse: move ioctl to separate source file vfs: remove unused ioctl helpers ubifs: convert to fileattr reiserfs: convert to fileattr ocfs2: convert to fileattr nilfs2: convert to fileattr jfs: convert to fileattr hfsplus: convert to fileattr efivars: convert to fileattr xfs: convert to fileattr orangefs: convert to fileattr gfs2: convert to fileattr f2fs: convert to fileattr ext4: convert to fileattr ext2: convert to fileattr btrfs: convert to fileattr ... |
|
|
|
57fa2369ab |
CFI on arm64 series for v5.13-rc1
- Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen) -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmCHCR8ACgkQiXL039xt wCZyFQ//fnUZaXR2K354zDyW6CJljMf+d94RF6rH+J6eMTH2/HXa5v0iJokwABLf ussP6qF4k5wtmI22Gm9A5Zc3e4iiry5pC0jOdk0mk4gzWwFN9MdgNxJZIGA3xqhS bsBK4AGrVKjtZl48G1/ZxJuNDeJhVp6GNK2n6/Gl4rZF6R7D/Upz0XelyJRdDpcM HIGma7jZl6xfGU0mdWCzpOGK1zdMca1WVs7A4YuurSbLn5PZJrcNVWLouDqt/Si2 AduSri1gyPClicgvqWjMOzhUpuw/nJtBLRl1x1EsWk/KSZ1/uNVjlewfzdN4fZrr zbtFr2gLubYLK6JOX7/LqoHlOTgE3tYLL+WIVN75DsOGZBKgHhmebTmWLyqzV0SL oqcyM5d3ucC6msdtAK5Fv4MSp8rpjqlK1Ha4SGRT6kC2wut7AhZ3KD7eyRIz8mV9 Sa9mhignGFJnTEUp+LSbYdrAudgSKxB40WyXPmswAXX4VJFRD4ONrrcAON/SzkUT Hw/JdFRCKkJjgwNQjIQoZcUNMTbFz2PlNIEnjJWm38YImQKQlCb2mXaZKCwBkf45 aheCZk17eKoxTCXFMd+KxlyNEtS2yBfq/PpZgvw7GW/pfFbWUg1+2O41LnihIe5v zu0hN1wNCQqgfxiMZqX1OTb9C/2vybzGsXILt+9nppjZ8EBU7iU= =wU6U -----END PGP SIGNATURE----- Merge tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull CFI on arm64 support from Kees Cook: "This builds on last cycle's LTO work, and allows the arm64 kernels to be built with Clang's Control Flow Integrity feature. This feature has happily lived in Android kernels for almost 3 years[1], so I'm excited to have it ready for upstream. The wide diffstat is mainly due to the treewide fixing of mismatched list_sort prototypes. Other things in core kernel are to address various CFI corner cases. The largest code portion is the CFI runtime implementation itself (which will be shared by all architectures implementing support for CFI). The arm64 pieces are Acked by arm64 maintainers rather than coming through the arm64 tree since carrying this tree over there was going to be awkward. CFI support for x86 is still under development, but is pretty close. There are a handful of corner cases on x86 that need some improvements to Clang and objtool, but otherwise works well. Summary: - Clean up list_sort prototypes (Sami Tolvanen) - Introduce CONFIG_CFI_CLANG for arm64 (Sami Tolvanen)" * tag 'cfi-v5.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: arm64: allow CONFIG_CFI_CLANG to be selected KVM: arm64: Disable CFI for nVHE arm64: ftrace: use function_nocfi for ftrace_call arm64: add __nocfi to __apply_alternatives arm64: add __nocfi to functions that jump to a physical address arm64: use function_nocfi with __pa_symbol arm64: implement function_nocfi psci: use function_nocfi for cpu_resume lkdtm: use function_nocfi treewide: Change list_sort to use const pointers bpf: disable CFI in dispatcher functions kallsyms: strip ThinLTO hashes from static functions kthread: use WARN_ON_FUNCTION_MISMATCH workqueue: use WARN_ON_FUNCTION_MISMATCH module: ensure __cfi_check alignment mm: add generic function_nocfi macro cfi: add __cficanonical add support for Clang CFI |
|
|
|
18bb8bbf13 |
btrfs: zoned: automatically reclaim zones
When a file gets deleted on a zoned file system, the space freed is not returned back into the block group's free space, but is migrated to zone_unusable. As this zone_unusable space is behind the current write pointer it is not possible to use it for new allocations. In the current implementation a zone is reset once all of the block group's space is accounted as zone unusable. This behaviour can lead to premature ENOSPC errors on a busy file system. Instead of only reclaiming the zone once it is completely unusable, kick off a reclaim job once the amount of unusable bytes exceeds a user configurable threshold between 51% and 100%. It can be set per mounted filesystem via the sysfs tunable bg_reclaim_threshold which is set to 75% by default. Similar to reclaiming unused block groups, these dirty block groups are added to a to_reclaim list and then on a transaction commit, the reclaim process is triggered but after we deleted unused block groups, which will free space for the relocation process. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f33720657d |
btrfs: rename delete_unused_bgs_mutex to reclaim_bgs_lock
As a preparation for extending the block group deletion use case, rename the unused_bgs_mutex to reclaim_bgs_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
01e86008aa |
btrfs: zoned: reset zones of relocated block groups
When relocating a block group the freed up space is not discarded in one big block, but each extent is discarded on its own with -odisard=sync. For a zoned filesystem we need to discard the whole block group at once, so btrfs_discard_extent() will translate the discard into a REQ_OP_ZONE_RESET operation, which then resets the device's zone. Failure to reset the zone is not fatal error. Discussion about the approach and regarding transaction blocking: https://lore.kernel.org/linux-btrfs/CAL3q7H4SjS_d5rBepfTMhU8Th3bJzdmyYd0g4Z60yUgC_rC_ZA@mail.gmail.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e9306ad4ef |
btrfs: more graceful errors/warnings on 32bit systems when reaching limits
Btrfs uses internally mapped u64 address space for all its metadata. Due to the page cache limit on 32bit systems, btrfs can't access metadata at or beyond (ULONG_MAX + 1) << PAGE_SHIFT. See how MAX_LFS_FILESIZE and page::index are defined. This is 16T for 4K page size while 256T for 64K page size. Users can have a filesystem which doesn't have metadata beyond the boundary at mount time, but later balance can cause it to create metadata beyond the boundary. And modification to MM layer is unrealistic just for such minor use case. We can't do more than to prevent mounting such filesystem or warn early when the numbers are still within the limits. To address such problem, this patch will introduce the following checks: - Mount time rejection This will reject any fs which has metadata chunk at or beyond the boundary. - Mount time early warning If there is any metadata chunk beyond 5/8th of the boundary, we do an early warning and hope the end user will see it. - Runtime extent buffer rejection If we're going to allocate an extent buffer at or beyond the boundary, reject such request with EOVERFLOW. This is definitely going to cause problems like transaction abort, but we have no better ways. - Runtime extent buffer early warning If an extent buffer beyond 5/8th of the max file size is allocated, do an early warning. Above error/warning message will only be printed once for each fs to reduce dmesg flood. If the mount is rejected, the filesystem will be mountable only on a 64bit host. Link: https://lore.kernel.org/linux-btrfs/1783f16d-7a28-80e6-4c32-fdf19b705ed0@gmx.com/ Reported-by: Erik Jensen <erikjensen@rkjnsn.net> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0dc16ef4f6 |
btrfs: zoned: fix unpaired block group unfreeze during device replace
When doing a device replace on a zoned filesystem, if we find a block
group with ->to_copy == 0, we jump to the label 'done', which will result
in later calling btrfs_unfreeze_block_group(), even though at this point
we never called btrfs_freeze_block_group().
Since at this point we have neither turned the block group to RO mode nor
made any progress, we don't need to jump to the label 'done'. So fix this
by jumping instead to the label 'skip' and dropping our reference on the
block group before the jump.
Fixes:
|
|
|
|
f9690f426b |
btrfs: fix race when picking most recent mod log operation for an old root
Commit |
|
|
|
67addf2900 |
btrfs: fix metadata extent leak after failure to create subvolume
When creating a subvolume we allocate an extent buffer for its root node after starting a transaction. We setup a root item for the subvolume that points to that extent buffer and then attempt to insert the root item into the root tree - however if that fails, due to ENOMEM for example, we do not free the extent buffer previously allocated and we do not abort the transaction (as at that point we did nothing that can not be undone). This means that we effectively do not return the metadata extent back to the free space cache/tree and we leave a delayed reference for it which causes a metadata extent item to be added to the extent tree, in the next transaction commit, without having backreferences. When this happens 'btrfs check' reports the following: $ btrfs check /dev/sdi Opening filesystem to check... Checking filesystem on /dev/sdi UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb [1/7] checking root items [2/7] checking extents ref mismatch on [30425088 16384] extent item 1, found 0 backref 30425088 root 256 not referenced back 0x564a91c23d70 incorrect global backref count on 30425088 found 1 wanted 0 backpointer mismatch on [30425088 16384] owner ref check failed [30425088 16384] ERROR: errors found in extent allocation tree or chunk allocation [3/7] checking free space cache [4/7] checking fs roots [5/7] checking only csums items (without verifying data) [6/7] checking root refs [7/7] checking quota groups skipped (not enabled on this FS) found 212992 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 124669 file data blocks allocated: 65536 referenced 65536 So fix this by freeing the metadata extent if btrfs_insert_root() returns an error. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1d8ba9e7e7 |
btrfs: handle remount to no compress during compression
[BUG]
When running btrfs/071 with inode_need_compress() removed from
compress_file_range(), we got the following crash:
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
RIP: 0010:compress_file_range+0x476/0x7b0 [btrfs]
Call Trace:
? submit_compressed_extents+0x450/0x450 [btrfs]
async_cow_start+0x16/0x40 [btrfs]
btrfs_work_helper+0xf2/0x3e0 [btrfs]
process_one_work+0x278/0x5e0
worker_thread+0x55/0x400
? process_one_work+0x5e0/0x5e0
kthread+0x168/0x190
? kthread_create_worker_on_cpu+0x70/0x70
ret_from_fork+0x22/0x30
---[ end trace 65faf4eae941fa7d ]---
This is already after the patch "btrfs: inode: fix NULL pointer
dereference if inode doesn't need compression."
[CAUSE]
@pages is firstly created by kcalloc() in compress_file_extent():
pages = kcalloc(nr_pages, sizeof(struct page *), GFP_NOFS);
Then passed to btrfs_compress_pages() to be utilized there:
ret = btrfs_compress_pages(...
pages,
&nr_pages,
...);
btrfs_compress_pages() will initialize each page as output, in
zlib_compress_pages() we have:
pages[nr_pages] = out_page;
nr_pages++;
Normally this is completely fine, but there is a special case which
is in btrfs_compress_pages() itself:
switch (type) {
default:
return -E2BIG;
}
In this case, we didn't modify @pages nor @out_pages, leaving them
untouched, then when we cleanup pages, the we can hit NULL pointer
dereference again:
if (pages) {
for (i = 0; i < nr_pages; i++) {
WARN_ON(pages[i]->mapping);
put_page(pages[i]);
}
...
}
Since pages[i] are all initialized to zero, and btrfs_compress_pages()
doesn't change them at all, accessing pages[i]->mapping would lead to
NULL pointer dereference.
This is not possible for current kernel, as we check
inode_need_compress() before doing pages allocation.
But if we're going to remove that inode_need_compress() in
compress_file_extent(), then it's going to be a problem.
[FIX]
When btrfs_compress_pages() hits its default case, modify @out_pages to
0 to prevent such problem from happening.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=212331
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
1d68128c10 |
btrfs: zoned: fail mount if the device does not support zone append
For zoned btrfs, zone append is mandatory to write to a sequential write only zone, otherwise parallel writes to the same zone could result in unaligned write errors. If a zoned block device does not support zone append (e.g. a dm-crypt zoned device using a non-NULL IV cypher), fail to mount. CC: stable@vger.kernel.org # 5.12 Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
061dde8245 |
btrfs: fix race between transaction aborts and fsyncs leading to use-after-free
There is a race between a task aborting a transaction during a commit,
a task doing an fsync and the transaction kthread, which leads to an
use-after-free of the log root tree. When this happens, it results in a
stack trace like the following:
BTRFS info (device dm-0): forced readonly
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
BTRFS: error (device dm-0) in cleanup_transaction:1958: errno=-5 IO failure
BTRFS warning (device dm-0): lost page write due to IO error on /dev/mapper/error-test (-5)
BTRFS warning (device dm-0): Skipping commit of aborted transaction.
BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0xa4e8 len 4096 err no 10
BTRFS error (device dm-0): error writing primary super block to device 1
BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e000 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e008 len 4096 err no 10
BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 sector 0x12e010 len 4096 err no 10
BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 IO failure (1 errors while writing supers)
BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO failure
general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b68: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 2458471 Comm: fsstress Not tainted 5.12.0-rc5-btrfs-next-84 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
RIP: 0010:__mutex_lock+0x139/0xa40
Code: c0 74 19 (...)
RSP: 0018:ffff9f18830d7b00 EFLAGS: 00010202
RAX: 6b6b6b6b6b6b6b68 RBX: 0000000000000001 RCX: 0000000000000002
RDX: ffffffffb9c54d13 RSI: 0000000000000000 RDI: 0000000000000000
RBP: ffff9f18830d7bc0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff9f18830d7be0 R11: 0000000000000001 R12: ffff8c6cd199c040
R13: ffff8c6c95821358 R14: 00000000fffffffb R15: ffff8c6cbcf01358
FS: 00007fa9140c2b80(0000) GS:ffff8c6fac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa913d52000 CR3: 000000013d2b4003 CR4: 0000000000370ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
? btrfs_sync_log+0x7c1/0xf20 [btrfs]
? btrfs_sync_log+0x7c1/0xf20 [btrfs]
btrfs_sync_log+0x7c1/0xf20 [btrfs]
btrfs_sync_file+0x40c/0x580 [btrfs]
do_fsync+0x38/0x70
__x64_sys_fsync+0x10/0x20
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7fa9142a55c3
Code: 8b 15 09 (...)
RSP: 002b:00007fff26278d48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
RAX: ffffffffffffffda RBX: 0000563c83cb4560 RCX: 00007fa9142a55c3
RDX: 00007fff26278cb0 RSI: 00007fff26278cb0 RDI: 0000000000000005
RBP: 0000000000000005 R08: 0000000000000001 R09: 00007fff26278d5c
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000340
R13: 00007fff26278de0 R14: 00007fff26278d96 R15: 0000563c83ca57c0
Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
---[ end trace ee2f1b19327d791d ]---
The steps that lead to this crash are the following:
1) We are at transaction N;
2) We have two tasks with a transaction handle attached to transaction N.
Task A and Task B. Task B is doing an fsync;
3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
into a local variable named 'log_root_tree' at the top of
btrfs_sync_log(). Task B is about to call write_all_supers(), but
before that...
4) Task A calls btrfs_commit_transaction(), and after it sets the
transaction state to TRANS_STATE_COMMIT_START, an error happens before
it waits for the transaction's 'num_writers' counter to reach a value
of 1 (no one else attached to the transaction), so it jumps to the
label "cleanup_transaction";
5) Task A then calls cleanup_transaction(), where it aborts the
transaction, setting BTRFS_FS_STATE_TRANS_ABORTED on fs_info->fs_state,
setting the ->aborted field of the transaction and the handle to an
errno value and also setting BTRFS_FS_STATE_ERROR on fs_info->fs_state.
After that, at cleanup_transaction(), it deletes the transaction from
the list of transactions (fs_info->trans_list), sets the transaction
to the state TRANS_STATE_COMMIT_DOING and then waits for the number
of writers to go down to 1, as it's currently 2 (1 for task A and 1
for task B);
6) The transaction kthread is running and sees that BTRFS_FS_STATE_ERROR
is set in fs_info->fs_state, so it calls btrfs_cleanup_transaction().
There it sees the list fs_info->trans_list is empty, and then proceeds
into calling btrfs_drop_all_logs(), which frees the log root tree with
a call to btrfs_free_log_root_tree();
7) Task B calls write_all_supers() and, shortly after, under the label
'out_wake_log_root', it deferences the pointer stored in
'log_root_tree', which was already freed in the previous step by the
transaction kthread. This results in a use-after-free leading to a
crash.
Fix this by deleting the transaction from the list of transactions at
cleanup_transaction() only after setting the transaction state to
TRANS_STATE_COMMIT_DOING and waiting for all existing tasks that are
attached to the transaction to release their transaction handles.
This makes the transaction kthread wait for all the tasks attached to
the transaction to be done with the transaction before dropping the
log roots and doing other cleanups.
Fixes:
|
|
|
|
c4aec299fa |
btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
The new function, submit_eb_subpage(), will submit all the dirty extent buffers in the page. The major difference between submit_eb_page() and submit_eb_subpage() is: - How to grab extent buffer Now we use find_extent_buffer_nospinlock() other than using page::private. All other different handling is already done in functions like lock_extent_buffer_for_io() and write_one_eb(). Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3156df944 |
btrfs: make lock_extent_buffer_for_io() to be subpage compatible
For subpage metadata, we don't use page locking at all. So just skip the page locking part for subpage. The rest of the function can be reused. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
35b6ddfa96 |
btrfs: introduce write_one_subpage_eb() function
The new function, write_one_subpage_eb(), as a subroutine for subpage metadata write, will handle the extent buffer bio submission. The major differences between the new write_one_subpage_eb() and write_one_eb() is: - No page locking When entering write_one_subpage_eb() the page is no longer locked. We only lock the page for its status update, and unlock immediately. Now we completely rely on extent io tree locking. - Extra bitmap update along with page status update Now page dirty and writeback is controlled by btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap. They both follow the schema that any sector is dirty/writeback, then the full page gets dirty/writeback. - When to update the nr_written number Now we take a shortcut, if we have cleared the last dirty bit of the page, we update nr_written. This is not completely perfect, but should emulate the old behavior well enough. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2f3186d8ee |
btrfs: introduce end_bio_subpage_eb_writepage() function
The new function, end_bio_subpage_eb_writepage(), will handle the metadata writeback endio. The major differences involved are: - How to grab extent buffer Now page::private is a pointer to btrfs_subpage, we can no longer grab extent buffer directly. Thus we need to use the bv_offset to locate the extent buffer manually and iterate through the whole range. - Use btrfs_subpage_end_writeback() caller This helper will handle the subpage writeback for us. Since this function is executed under endio context, when grabbing extent buffers it can't grab eb->refs_lock as that lock is not designed to be grabbed under hardirq context. So here introduce a helper, find_extent_buffer_nolock(), for such situation, and convert find_extent_buffer() to use that helper. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb686c6824 |
btrfs: check return value of btrfs_commit_transaction in relocation
There are a few places where we don't check the return value of btrfs_commit_transaction in relocation.c. Thankfully all these places have straightforward error handling, so simply change all of the sites at once. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
24213fa46c |
btrfs: do proper error handling in merge_reloc_roots
We have a BUG_ON() if we get an error back from btrfs_get_fs_root(). This honestly should never fail, as at this point we have a solid coordination of fs root to reloc root, and these roots will all be in memory. But in the name of killing BUG_ON()'s remove these and handle the error condition properly, ASSERT()'ing for developers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8717cf440d |
btrfs: handle extent corruption with select_one_root properly
In corruption cases we could have paths from a block up to no root at all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by adding an ASSERT() for developers, and returning an error for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e0b085b0b0 |
btrfs: cleanup error handling in prepare_to_merge
This probably can't happen even with a corrupt file system, because we would have failed much earlier on than here. However there's no reason we can't just check and bail out as appropriate, so do that and convert the correctness BUG_ON() to an ASSERT(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
57a304cfd4 |
btrfs: do not panic in __add_reloc_root
If we have a duplicate entry for a reloc root then we could have fs corruption that resulted in a double allocation. Since this shouldn't happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all of the callers of __add_reloc_root() to catch any logic mistakes for developers, otherwise normal error handling will happen for normal users. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c9258632c |
btrfs: handle __add_reloc_root failures in btrfs_recover_relocation
We can already handle errors appropriately from this function, deal with an error coming from __add_reloc_root appropriately. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
790c1b8cd4 |
btrfs: do proper error handling in create_reloc_inode
We already handle some errors in this function, and the callers do the correct error handling, so clean up the rest of the function to do the appropriate error handling. There's a little extra work that needs to be done here, as we create the inode item before we create the orphan item. We could potentially add the orphan item, but if we failed to create the inode item we would have to abort the transaction. Instead add a helper to delete the inode item we created in the case that we're unable to look up the inode (this would likely be caused by an ENOMEM), which if it succeeds means we can avoid a transaction abort in this particular error case. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
24cd638902 |
btrfs: remove the extent item sanity checks in relocate_block_group
These checks are all taken care of for us by the tree checker code: - the flags don't change or are updated consistently - the v0 extent item format is invalid and caught in many other places too Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0ebb6bbbd4 |
btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly
We need to validate that a data extent item does not have the FULL_BACKREF flag set on its flags. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
eb6b7fb4b5 |
btrfs: handle extent reference errors in do_relocation
We can already deal with errors appropriately from do_relocation, simply handle any errors that come from changing the refs at this point cleanly. We have to abort the transaction if we fail here as we've modified metadata at this point. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
253e258c34 |
btrfs: handle errors in reference count manipulation in replace_path
If any of the reference count manipulation stuff fails in replace_path we need to abort the transaction, as we've modified the blocks already. We can simply break at this point and everything will be cleaned up. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0e9873e2fe |
btrfs: handle btrfs_search_slot failure in replace_path
The search can fail for various reasons, in case of errors there's no cleanup to be done so we can pass the error to the caller, adjusting for the case where the key is not found and search slot returns 1. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
45b87c5d25 |
btrfs: handle btrfs_cow_block errors in replace_path
If we error out COWing the root node when doing a replace_path then we simply unlock and free the buffer and return the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7a9213a935 |
btrfs: convert logic BUG_ON()'s in replace_path to ASSERT()'s
A few BUG_ON()'s in replace_path are purely to keep us from making logical mistakes, so replace them with ASSERT()'s. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
592fbcd50c |
btrfs: do proper error handling in btrfs_update_reloc_root
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for all sorts of reasons, including IO errors. Instead of panicing the box lets return the error, now that all callers properly handle those errors. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bbae13f8ab |
btrfs: handle btrfs_update_reloc_root failure in prepare_to_merge
btrfs_update_reloc_root will will return errors in the future, so handle an error properly in prepare_to_merge. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7934133fae |
btrfs: handle btrfs_update_reloc_root failure in insert_dirty_subvol
btrfs_update_reloc_root will will return errors in the future, so handle the error properly in insert_dirty_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ac54da6c37 |
btrfs: change insert_dirty_subvol to return errors
This will be able to return errors in the future, so change it to return an error and handle the errors appropriately. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2dd8298eb3 |
btrfs: handle btrfs_update_reloc_root failure in commit_fs_roots
btrfs_update_reloc_root will will return errors in the future, so handle the error properly in commit_fs_roots. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
39200e5908 |
btrfs: validate root::reloc_root after recording root in trans
If we fail to setup a root->reloc_root in a different thread that path will error out, however it still leaves root->reloc_root NULL but would still appear set up in the transaction. Subsequent calls to btrfs_record_root_in_transaction would succeed without attempting to create the reloc root, as the transid has already been updated. Handle this case by making sure we have a root->reloc_root set after a btrfs_record_root_in_transaction call so we don't end up dereferencing a NULL pointer. Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
84c50ba521 |
btrfs: do proper error handling in create_reloc_root
We do memory allocations here, read blocks from disk, all sorts of operations that could easily fail at any given point. Instead of panicing the box, simply return the error back up the chain, all callers at this point have proper error handling. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
00bb36a0e7 |
btrfs: have proper error handling in btrfs_init_reloc_root
create_reloc_root will return errors in the future, and __add_reloc_root can return ENOMEM or EEXIST, so handle these errors properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
03a7e111a9 |
btrfs: return an error from btrfs_record_root_in_trans
We can create a reloc root when we record the root in the trans, which can fail for all sorts of different reasons. Propagate this error up the chain of callers. Future patches will fix the callers of btrfs_record_root_in_trans() to handle the error. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f0118cb6bc |
btrfs: handle record_root_in_trans failure in create_pending_snapshot
record_root_in_trans can currently fail, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1409e6cc74 |
btrfs: handle record_root_in_trans failure in btrfs_record_root_in_trans
record_root_in_trans can fail currently, handle this failure properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1c442d2246 |
btrfs: handle record_root_in_trans failure in qgroup_account_snapshot
record_root_in_trans can fail currently, so handle this failure properly. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
68075ea8d7 |
btrfs: handle btrfs_record_root_in_trans failure in start_transaction
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in start_transaction. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d18c7bd95c |
btrfs: handle btrfs_record_root_in_trans failure in relocate_tree_block
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in relocate_tree_block. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
221581e485 |
btrfs: handle btrfs_record_root_in_trans failure in create_subvol
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in create_subvol. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2002ae112a |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_recover_log_trees
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_recover_log_trees. This appears tricky, however we have a reference count on the destination root, so if this fails we need to continue on in the loop to make sure the proper cleanup is done. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2731f5186b |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_delete_subvolume
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_delete_subvolume. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b0fec6fd33 |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
00aa8e87c9 |
btrfs: handle btrfs_record_root_in_trans failure in btrfs_rename_exchange
btrfs_record_root_in_trans will return errors in the future, so handle the error properly in btrfs_rename_exchange. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
404bccbcaa |
btrfs: do proper error handling in record_reloc_root_in_trans
Generally speaking this shouldn't ever fail, the corresponding fs root for the reloc root will already be in memory, so we won't get ENOMEM here. However if there is no corresponding root for the reloc root then we could get ENOMEM when we try to allocate it or we could get ENOENT when we look it up and see that it doesn't exist. Convert these BUG_ON()'s into ASSERT()'s and add proper error handling for the case of corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
92de551b83 |
btrfs: check record_root_in_trans related failures in select_reloc_root
We will record the fs root or the reloc root in the trans in select_reloc_root. These will actually return errors in the following patches, so check their return value here and return it up the stack. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8ee66afe99 |
btrfs: convert BUG_ON()'s in select_reloc_root() to proper errors
We have several BUG_ON()'s in select_reloc_root() that can be tripped if there is an extent tree corruption. Convert these to ASSERT()'s, because if we hit it during testing it really is bad, or could indicate a problem with the backref walking code. However if users hit these problems it generally indicates corruption, I've hit a few machines in the fleet that trip over these with clearly corrupted extent trees, so be nice and print out an error message and return an error instead of bringing the whole box down. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cbdc2ebc7c |
btrfs: handle errors from select_reloc_root()
Currently select_reloc_root() doesn't return an error, but followup patches will make it possible for it to return an error. We do have proper error recovery in do_relocation however, so handle the possibility of select_reloc_root() having an error properly instead of BUG_ON(!root). I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we don't find a root, instead of NULL, to make the error case easier to deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this case as it indicates we messed up the backref walking code, but it could also indicate corruption. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1c7bfa159f |
btrfs: convert BUG_ON()'s in relocate_tree_block
We have a couple of BUG_ON()'s in relocate_tree_block() that can be tripped if we have file system corruption. Convert these to ASSERT()'s so developers still get yelled at when they break the backref code, but error out nicely for users so the whole box doesn't go down. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ffe30dd892 |
btrfs: convert some BUG_ON()'s to ASSERT()'s in do_relocation
A few of these are checking for correctness, and won't be triggered by corrupted file systems, so convert them to ASSERT() instead of BUG_ON() and add a comment explaining their existence. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
32c0a6bcaa |
btrfs: add and use readahead_batch_length
Implement readahead_batch_length() to determine the number of bytes in the current batch of readahead pages and use it in btrfs. Also use the readahead_pos to get the offset. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
183ebab766 |
btrfs: move forward declarations to the beginning of extent_io.h
There are two forward declarations deep in extent_io.h, move them to the beginning and remove the duplicate one. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
894d137818 |
btrfs: subpage: add overview comments
This patch adds an overview how btrfs subpage support works: - limitations - behavior - basic implementation points Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5a2c60752a |
btrfs: make set_btree_ioerr accept extent buffer and be subpage compatible
Current set_btree_ioerr() only accepts @page parameter and grabs extent buffer from page::private. This works fine for sector size == PAGE_SIZE case, but not for subpage case. Add an extra parameter, @eb, for callers to pass extent buffer to this function, so that subpage code can reuse this function. And also add subpage special handling to update btrfs_subpage::error_bitmap. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0d27797e92 |
btrfs: make set/clear_extent_buffer_dirty() subpage compatible
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
b8f957715e |
btrfs: support page uptodate assertions in subpage mode
There are quite some assert checks on page uptodate in extent buffer write accessors. They ensure the destination page is already uptodate. This is fine for regular sector size case, but not for subpage case, as for subpage we only mark the page uptodate if the page contains no hole and all its extent buffers are uptodate. So instead of checking PageUptodate(), for subpage case we check the uptodate bitmap of btrfs_subpage structure. To make the check more elegant, introduce a helper, assert_eb_page_uptodate() to do the check for both subpage and regular sector size cases. The following functions are involved: - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer() - extent_buffer_test_bit() - extent_buffer_bitmap_set() - extent_buffer_bitmap_clear() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1e5eb3d6a4 |
btrfs: make alloc_extent_buffer() check subpage dirty bitmap
In alloc_extent_buffer(), we make sure that the newly allocated page is never dirty. This is fine for sector size == PAGE_SIZE case, but for subpage it's possible that one extent buffer in the page is dirty, thus the whole page is marked dirty, and could cause false alert. To support subpage, call btrfs_page_test_dirty() to handle both cases. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
eca0f6f643 |
btrfs: subpage: support metadata checksum calculation at write time
Add a new helper, csum_dirty_subpage_buffers(), to iterate through all dirty extent buffers in one bvec. Also extract the code of calculating csum for one extent buffer into csum_one_extent_buffer(), so that both the existing csum_dirty_buffer() and the new csum_dirty_subpage_buffers() can reuse the same routine. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
139e8cd325 |
btrfs: subpage: do more sanity checks on metadata page dirtying
For btree_set_page_dirty(), we should also check the extent buffer sanity for subpage support. Unlike the regular sector size case, since one page can contain multiple extent buffers, we need to make sure there is at least one dirty extent buffer in the page. So this patch will iterate through the btrfs_subpage::dirty_bitmap to get the extent buffers, and check if any dirty extent buffer in the page range has EXTENT_BUFFER_DIRTY and proper refs. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3470da3b7d |
btrfs: subpage: introduce helpers for writeback status
Introduces the following functions to handle subpage writeback status: - btrfs_subpage_set_writeback() - btrfs_subpage_clear_writeback() - btrfs_subpage_test_writeback() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_writeback() - btrfs_page_clear_writeback() - btrfs_page_test_writeback() These helpers can handle both regular sector size and subpage without problem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d8a5713e89 |
btrfs: subpage: introduce helpers for dirty status
Introduce the following functions to handle subpage dirty status: - btrfs_subpage_set_dirty() - btrfs_subpage_clear_dirty() - btrfs_subpage_test_dirty() These helpers can only be called when the range is ensured to be inside the page. - btrfs_page_set_dirty() - btrfs_page_clear_dirty() - btrfs_page_test_dirty() These helpers can handle both regular sector size and subpage without problem. Thus they would be used to replace PageDirty() related calls in later patches. There is one special point to note here, just like set_page_dirty() and clear_page_dirty_for_io(), btrfs_*page_set_dirty() and btrfs_*page_clear_dirty() must be called with page locked. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d239bcb83b |
btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
In btrfs_invalidatepage() we re-declare @tree variable as btrfs_ordered_inode_tree. Since it's only used to do the spinlock, we can grab it from inode directly, and remove the unnecessary declaration completely. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ac5804eb85 |
btrfs: use min() to replace open-code in btrfs_invalidatepage()
In btrfs_invalidatepage() we introduce a temporary variable, new_len, to update ordered->truncated_len. But we can use min() to replace it completely and no need for the variable. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fc57ad8d33 |
btrfs: add sysfs interface for supported sectorsize
Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes. Currently all architectures have PAGE_SIZE, There's some disparity between read-only and read-write support but that will be unified in the future so there's only one file exporting the size. The read-only support for systems with 64K pages also works for 4K sector size. This new sysfs interface would help eg. mkfs.btrfs to print more accurate warnings about potentially incompatible option combinations. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ace75066ce |
btrfs: improve btree readahead for full send operations
Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:
1) It only triggers readahead for leaves that are physically close
to the leaf being read, within a 64K range;
2) It only triggers readahead for the next or previous leaves if the
leaf being read is not currently in memory;
3) It never triggers readahead for nodes.
So add a new readahead mode that addresses all these points and use it
for full send operations.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
The durations of the full send operation in seconds were the following:
Before this change: 217 seconds
After this change: 205 seconds (-5.7%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
eafa4fd0ad |
btrfs: fix exhaustion of the system chunk array due to concurrent allocations
When we are running out of space for updating the chunk tree, that is, when we are low on available space in the system space info, if we have many task concurrently allocating block groups, via fallocate for example, many of them can end up all allocating new system chunks when only one is needed. In extreme cases this can lead to exhaustion of the system chunk array, which has a size limit of 2048 bytes, and results in a transaction abort with errno EFBIG, producing a trace in dmesg like the following, which was triggered on a PowerPC machine with a node/leaf size of 64K: [1359.518899] ------------[ cut here ]------------ [1359.518980] BTRFS: Transaction aborted (error -27) [1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519152] Modules linked in: (...) [1359.519239] Supported: Yes, External [1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3 [1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8 [1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default) [1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007 [1359.519356] CFAR: c00000000013e170 IRQMASK: 0 [1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026 [1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007 [1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000 [1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0 [1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff [1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000 [1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0 [1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08 [1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs] [1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] [1359.519626] Call Trace: [1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable) [1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs] [1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs] [1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs] [1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350 [1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0 [1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40 [1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170 [1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278 [1359.520037] Instruction dump: [1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000 [1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378 [1359.520122] ---[ end trace d6c186e151022e20 ]--- The following steps explain how we can end up in this situation: 1) Task A is at check_system_chunk(), either because it is allocating a new data or metadata block group, at btrfs_chunk_alloc(), or because it is removing a block group or turning a block group RO. It does not matter why; 2) Task A sees that there is not enough free space in the system space_info object, that is 'left' is < 'thresh'. And at this point the system space_info has a value of 0 for its 'bytes_may_use' counter; 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate a new system block group (chunk) and then reserves 'thresh' bytes in the chunk block reserve with the call to btrfs_block_rsv_add(). This changes the chunk block reserve's 'reserved' and 'size' counters by an amount of 'thresh', and changes the 'bytes_may_use' counter of the system space_info object from 0 to 'thresh'. Also during its call to btrfs_alloc_chunk(), we end up increasing the value of the 'total_bytes' counter of the system space_info object by 8MiB (the size of a system chunk stripe). This happens through the call chain: btrfs_alloc_chunk() create_chunk() btrfs_make_block_group() btrfs_update_space_info() 4) After it finishes the first phase of the block group allocation, at btrfs_chunk_alloc(), task A unlocks the chunk mutex; 5) At this point the new system block group was added to the transaction handle's list of new block groups, but its block group item, device items and chunk item were not yet inserted in the extent, device and chunk trees, respectively. That only happens later when we call btrfs_finish_chunk_alloc() through a call to btrfs_create_pending_block_groups(); Note that only when we update the chunk tree, through the call to btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter of the chunk block reserve as we COW/allocate extent buffers, through: btrfs_alloc_tree_block() btrfs_use_block_rsv() btrfs_block_rsv_use_bytes() And the system space_info's 'bytes_may_use' is decremented everytime we allocate an extent buffer for COW operations on the chunk tree, through: btrfs_alloc_tree_block() btrfs_reserve_extent() find_free_extent() btrfs_add_reserved_bytes() If we end up COWing less chunk btree nodes/leaves than expected, which is the typical case since the amount of space we reserve is always pessimistic to account for the worst possible case, we release the unused space through: btrfs_create_pending_block_groups() btrfs_trans_release_chunk_metadata() btrfs_block_rsv_release() block_rsv_release_bytes() btrfs_space_info_free_bytes_may_use() But before task A gets into btrfs_create_pending_block_groups()... 6) Many other tasks start allocating new block groups through fallocate, each one does the first phase of block group allocation in a serialized way, since btrfs_chunk_alloc() takes the chunk mutex before calling check_system_chunk() and btrfs_alloc_chunk(). However before everyone enters the final phase of the block group allocation, that is, before calling btrfs_create_pending_block_groups(), new tasks keep coming to allocate new block groups and while at check_system_chunk(), the system space_info's 'bytes_may_use' keeps increasing each time a task reserves space in the chunk block reserve. This means that eventually some other task can end up not seeing enough free space in the system space_info and decide to allocate yet another system chunk. This may repeat several times if yet more new tasks keep allocating new block groups before task A, and all the other tasks, finish the creation of the pending block groups, which is when reserved space in excess is released. Eventually this can result in exhaustion of system chunk array in the superblock, with btrfs_add_system_chunk() returning EFBIG, resulting later in a transaction abort. Even when we don't reach the extreme case of exhausting the system array, most, if not all, unnecessarily created system block groups end up being unused since when finishing creation of the first pending system block group, the creation of the following ones end up not needing to COW nodes/leaves of the chunk tree, so we never allocate and deallocate from them, resulting in them never being added to the list of unused block groups - as a consequence they don't get deleted by the cleaner kthread - the only exceptions are if we unmount and mount the filesystem again, which adds any unused block groups to the list of unused block groups, if a scrub is run, which also adds unused block groups to the unused list, and under some circumstances when using a zoned filesystem or async discard, which may also add unused block groups to the unused list. So fix this by: *) Tracking the number of reserved bytes for the chunk tree per transaction, which is the sum of reserved chunk bytes by each transaction handle currently being used; *) When there is not enough free space in the system space_info, if there are other transaction handles which reserved chunk space, wait for some of them to complete in order to have enough excess reserved space released, and then try again. Otherwise proceed with the creation of a new system chunk. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b7a7a83463 |
btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags
If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a file that has the S_SYNC attribute set, we totally ignore that and do not durably persist the reflink changes. Since a reflink can change the data readable from a file (and mtime/ctime, or a file size), it makes sense to durably persist (fsync) the source and destination files/ranges. This was previously discussed at: https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/ The recently introduced test case generic/628, from fstests, exercises these scenarios and currently fails without this change. So make sure we fsync the source and destination files/ranges when either of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set, just like XFS already does. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bb05b298af |
btrfs: zoned: bail out in btrfs_alloc_chunk for bad input
gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:
In function ‘init_alloc_chunk_ctl_policy_zoned’,
inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
| ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
5316 | struct alloc_chunk_ctl ctl;
| ^~~
If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers
btrfs_alloc_chunk
init_alloc_chunk_ctl
init_alloc_chunk_ctl_policy_zoned
so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.
Fixes:
|
|
|
|
3227788cd3 |
btrfs: fix a potential hole punching failure
In commit |
|
|
|
e75f9fd194 |
btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex
Commit |
|
|
|
2cdb3909c9 |
btrfs: use percpu_read_positive instead of sum_positive for need_preempt
Looking at perf data for a fio workload I noticed that we were spending
a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
need_preemptive_reclaim. This is silly, as we only want to know if we
have more ordered than delalloc to see if we should be counting the
delayed items in our threshold calculation. Change this to
percpu_read_positive() to avoid the overhead.
I ran this through fsperf to validate the changes, obviously the latency
numbers in dbench and fio are quite jittery, so take them as you wish,
but overall the improvements on throughput, iops, and bw are all
positive. Each test was run two times, the given value is the average
of both runs for their respective column.
btrfs ssd normal test results
bufferedrandwrite16g results
metric baseline current diff
==========================================================
write_io_kbytes 16777216 16777216 0.00%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 1.04e+08 1.05e+08 1.12%
read_iops 0 0 0.00%
write_clat_ns_p50 13888 11840 -14.75%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 35008 29312 -16.27%
read_bw_bytes 0 0 0.00%
elapsed 170 167 -1.76%
write_lat_ns_min 4221.50 3762.50 -10.87%
sys_cpu 39.65 35.37 -10.79%
write_lat_ns_max 2.67e+10 2.50e+10 -6.63%
read_lat_ns_min 0 0 0.00%
write_iops 25270.10 25553.43 1.12%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
dbench60 results
metric baseline current diff
==================================================
qpathinfo 11.12 12.73 14.52%
throughput 416.09 445.66 7.11%
flush 3485.63 1887.55 -45.85%
qfileinfo 0.70 1.92 173.86%
ntcreatex 992.60 695.76 -29.91%
qfsinfo 2.43 3.71 52.48%
close 1.67 3.14 88.09%
sfileinfo 66.54 105.20 58.10%
rename 809.23 619.59 -23.43%
find 16.88 15.46 -8.41%
unlink 820.54 670.86 -18.24%
writex 3375.20 2637.91 -21.84%
deltree 386.33 449.98 16.48%
readx 3.43 3.41 -0.60%
mkdir 0.05 0.03 -38.46%
lockx 0.26 0.26 -0.76%
unlockx 0.81 0.32 -60.33%
dio4kbs16threads results
metric baseline current diff
================================================================
write_io_kbytes 5249676 3357150 -36.05%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 89583501.50 57291192.50 -36.05%
read_iops 0 0 0.00%
write_clat_ns_p50 242688 263680 8.65%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 15826944 36732928 132.09%
read_bw_bytes 0 0 0.00%
elapsed 61 61 0.00%
write_lat_ns_min 42704 42095 -1.43%
sys_cpu 5.27 3.45 -34.52%
write_lat_ns_max 7.43e+08 9.27e+08 24.71%
read_lat_ns_min 0 0 0.00%
write_iops 21870.97 13987.11 -36.05%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
randwrite2xram results
metric baseline current diff
================================================================
write_io_kbytes 24831972 28876262 16.29%
read_clat_ns_p99 0 0 0.00%
write_bw_bytes 83745273.50 92182192.50 10.07%
read_iops 0 0 0.00%
write_clat_ns_p50 13952 11648 -16.51%
read_io_kbytes 0 0 0.00%
read_io_bytes 0 0 0.00%
write_clat_ns_p99 50176 52992 5.61%
read_bw_bytes 0 0 0.00%
elapsed 314 332 5.73%
write_lat_ns_min 5920.50 5127 -13.40%
sys_cpu 7.82 7.35 -6.07%
write_lat_ns_max 5.27e+10 3.88e+10 -26.44%
read_lat_ns_min 0 0 0.00%
write_iops 20445.62 22505.42 10.07%
read_lat_ns_max 0 0 0.00%
read_clat_ns_p50 0 0 0.00%
untarfirefox results
metric baseline current diff
==============================================
elapsed 47.41 47.40 -0.03%
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
e2b84217f3 |
btrfs: update outdated comment at btrfs_replace_file_extents()
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit
|
|
|
|
0c0218e9a6 |
btrfs: update outdated comment at btrfs_orphan_cleanup()
btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
function does not exists since commit
|
|
|
|
ffbc10a144 |
btrfs: update debug message when checking seq number of a delayed ref
We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit
|
|
|
|
4bae788075 |
btrfs: add and use helper to get lowest sequence number for the tree mod log
There are two places outside the tree mod log module that extract the lowest sequence number of the tree mod log. These places end up duplicating code and open coding the logic and internal implementation details of the tree mod log. So add a helper to the tree mod log module and header that returns the lowest sequence number or 0 if there aren't any tree mod log users at the moment. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ffe1d039d7 |
btrfs: remove unnecessary leaf check at btrfs_tree_mod_log_free_eb()
At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf, and if so, return immediately and do nothing. However this check can be removed, because after it we call tree_mod_need_log(), which returns false when given an extent buffer that corresponds to a leaf. So just remove the leaf check and pass the extent buffer to tree_mod_need_log(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
888dd18339 |
btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()
Instead of exposing implementation details of the tree mod log to check if there are active tree mod log users at btrfs_free_tree_block(), use the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This way extent-tree.c does not need to known about any of the internals of the tree mod log and avoids taking a lock unnecessarily as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bc03f39ec3 |
btrfs: use a bit to track the existence of tree mod log users
The tree modification log functions are called very frequently, basically they are called every time a btree is modified (a pointer added or removed to a node, a new root for a btree is set, etc). Because of that, to avoid heavy lock contention on the lock that protects the list of tree mod log users, we have checks that test the emptiness of the list with a full memory barrier before the checks, so that when there are no tree mod log users we avoid taking the lock. Replace the memory barrier and list emptiness check with a test for a new bit set at fs_info->flags. This bit is used to indicate when there are tree mod log users, set whenever a user is added to the list and cleared when the last user is removed from the list. This makes the intention a bit more obvious and possibly more efficient (assuming test_bit() may be cheaper than a full memory barrier on some architectures). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
406808ab2f |
btrfs: use booleans where appropriate for the tree mod log functions
Several functions of the tree modification log use integers as booleans, so change them to use booleans instead, making their use more clear. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3a84ccd28 |
btrfs: move the tree mod log code into its own file
The tree modification log, which records modifications done to btrees, is quite large and currently spread all over ctree.c, which is a huge file already. To make things better organized, move all that code into its own separate source and header files. Functions and definitions that are used outside of the module (mostly by ctree.c) are renamed so that they start with a "btrfs_" prefix. Everything else remains unchanged. This makes it easier to go over the tree modification log code every time I need to go read it to fix a bug. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor comment updates ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9a002d531b |
btrfs: integrity-checker: convert block context kmap's to kmap_local_page
btrfsic_read_block() (which calls kmap()) and btrfsic_release_block_ctx() (which calls kunmap()) are always called within a single thread of execution. Therefore the mappings created within these calls can be a thread local mapping. Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the unmap loops backwards through the array pointer so no adjustment needs to be made to the unmapping order. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3e037efdbd |
btrfs: integrity-checker: use kmap_local_page in __btrfsic_submit_bio
Again there is an array of pointers which must be unmapped in the correct order. Convert the kmap()'s to kmap_local_page() and adjust the unmapping to work backwards through the unmapping loop. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
94a0b58d2d |
btrfs: raid56: convert kmaps to kmap_local_page
These kmaps are thread local and don't need to be atomic. So they can use the more efficient kmap_local_page(). However, the mapping of pages in the stripes and the additional parity and qstripe pages are a bit trickier because the unmapping must occur in the opposite order from the mapping. Furthermore, the pointer array in __raid_recover_end_io() may get reordered. Convert these calls to kmap_local_page() taking care to reverse the unmappings of any page arrays as well as being careful with the mappings of any special pages such as the parity and qstripe pages. Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
58c1a35cd5 |
btrfs: convert kmap to kmap_local_page, simple cases
Use a simple coccinelle script to help convert the most common kmap()/kunmap() patterns to kmap_local_page()/kunmap_local(). Note that some kmaps which were caught by this script needed to be handled by hand because of the strict unmapping order of kunmap_local() so they are not included in this patch. But this script got us started. There's another temp variable added for the final length write to the first page so it does not interfere with cpage_out that is used for mapping other pages. The development of this patch was aided by the follow script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap and replace with kmap_local_page then mark kunmap // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ @ catch_all @ expression e, e2; @@ ( -kmap(e) +kmap_local_page(e) ) ... ( -kunmap(...) +kunmap_local() ) // </smpl> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cea628008f |
btrfs: remove duplicated in_range() macro
The in_range() macro is defined twice in btrfs' source, once in ctree.h and once in misc.h. Remove the definition in ctree.h and include misc.h in the files depending on it. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
209ecbb858 |
btrfs: remove stale comment and logic from btrfs_inode_in_log()
Currently btrfs_inode_in_log() checks the list of modified extents of the inode, and has a comment mentioning why, as it used to be necessary to make sure if we did something like the following: mmap write range A mmap write range B msync range A (ranged fsync) msync range B (ranged fsync) we ended up with both ranges being logged. If we did not check it, then the second fsync would do nothing because btrfs_inode_in_log() would return true. This was added in |
|
|
|
bc0939fcfa |
btrfs: fix race between marking inode needs to be logged and log syncing
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
885f46d87f |
btrfs: fix race between memory mapped writes and fsync
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in
|
|
|
|
8d9b4a162a |
btrfs: exclude mmap from happening during all fallocate operations
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? btrfs_fallocate+0xdcf/0x1260 [btrfs]
btrfs_fallocate+0xdfb/0x1260 [btrfs]
? filename_lookup+0xf1/0x180
vfs_fallocate+0x14f/0x440
ioctl_preallocate+0x92/0xc0
do_vfs_ioctl+0x66b/0x750
? __do_sys_newfstat+0x53/0x60
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
8c99516a8c |
btrfs: exclude mmaps while doing remap
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.
There also exists a deadlock, as described by Filipe
"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.
Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.
Basically what happens is the following:
1) A clone operation locks inodes A and B, flushes delalloc for both
inodes in the respective file ranges and waits for any ordered extents
in those ranges to complete;
2) Before the clone task locks the file ranges, another task does a
memory mapped write (which does not lock the inode) for one of the
inodes of the clone operation. So now we have a dirty page in one of
the ranges used by the clone operation;
3) The clone operation locks the file ranges for inodes A and B;
4) Later, when iterating over the file extents of inode A, the clone
task attempts to start a transaction. There's not enough available
free metadata space, so the async reclaim task is started (if not
running already) and we wait for someone to wake us up on our
reservation ticket;
5) The async reclaim task is not able to release space by any other
means and decides to flush delalloc for the inode of the clone
operation;
6) The workqueue job used to flush the inode blocks when starting
delalloc for the inode, since the file range is currently locked by
the clone task;
7) But the clone task is waiting on its reservation ticket and the async
reclaim task is waiting on the flush job to complete, which can't
progress since the clone task has the file range locked. So unless
some other task is able to release space, for example an ordered
extent for some other inode completes, we have a deadlock between all
these tasks;
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? lock_release+0x20e/0x4c0
btrfs_clone+0x3e4/0x7e0 [btrfs]
? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
btrfs_clone_files+0xf6/0x150 [btrfs]
btrfs_remap_file_range+0x324/0x3d0 [btrfs]
do_clone_file_range+0xd4/0x1f0
vfs_clone_file_range+0x4d/0x230
? lock_release+0x20e/0x4c0
ioctl_file_clone+0x8f/0xc0
do_vfs_ioctl+0x342/0x750
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
64708539cd |
btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers
A few places we intermix btrfs_inode_lock with a inode_unlock, and some places we just use inode_lock/inode_unlock instead of btrfs_inode_lock. None of these places are using this incorrectly, but as we adjust some of these callers it would be nice to keep everything consistent, so convert everybody to use btrfs_inode_lock/btrfs_inode_unlock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8318ba79ee |
btrfs: add a i_mmap_lock to our inode
We need to be able to exclude page_mkwrite from happening concurrently with certain operations. To facilitate this, add a i_mmap_lock to our inode, down_read() it in our mkwrite, and add a new ILOCK flag to indicate that we want to take the i_mmap_lock as well. I used pahole to check the size of the btrfs_inode, the sizes are as follows no lockdep: before: 1120 (3 per 4k page) after: 1160 (3 per 4k page) lockdep: before: 2072 (1 per 4k page) after: 2224 (1 per 4k page) We're slightly larger but it doesn't change how many objects we can fit per page. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5e295768a0 |
btrfs: remove mirror argument from btrfs_csum_verify_data()
The parameter mirror is not used and does not make sense for checksum verification of the given bio. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6e65ae7629 |
btrfs: remove force argument from run_delalloc_nocow()
force_cow can be calculated from inode and does not need to be passed as an argument. This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range() A new function, should_nocow() checks if the range should be NOCOWed or not. The function returns true iff either BTRFS_INODE_NODATA or BTRFS_INODE_PREALLOC, but is not a defrag extent. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d6ade6894e |
btrfs: don't opencode extent_changeset_free
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7000babdda |
btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned
Fix the following coccicheck warnings: ./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function 'dev_extent_hole_check_zoned' with return type bool. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2ce73c6335 |
btrfs: add btree read ahead for incremental send operations
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
19358b154f |
btrfs: add btree read ahead for full send operations
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.
This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, full send duration:
with $initial_file_count == 200000: 165 seconds
with $initial_file_count == 500000: 407 seconds
After this change, full send duration:
with $initial_file_count == 200000: 149 seconds (-10.2%)
with $initial_file_count == 500000: 353 seconds (-14.2%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves. The roots were at level 2.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
98686ffc71 |
btrfs: simplify code flow in btrfs_delayed_inode_reserve_metadata
btrfs_block_rsv_add can return only ENOSPC since it's called with NO_FLUSH modifier. This so simplify the logic in btrfs_delayed_inode_reserve_metadata to exploit this invariant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add assert and comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8e3c9d3cf8 |
btrfs: remove btrfs_inode parameter from btrfs_delayed_inode_reserve_metadata
It's only used for tracepoint to obtain the inode number, but we already have the ino from btrfs_delayed_node::inode_id. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ae396a3b7a |
btrfs: simplify commit logic in try_flush_qgroup
It's no longer expected to call this function with an open transaction so all the workarounds concerning this can be removed. In fact it'll constitute a bug to call this function with a transaction already held so WARN in this case. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e5ce988690 |
btrfs: scrub: drop a few function declarations
Drop function declarations at the beginning of the file scrub.c. These functions are defined before they are used in the same file and don't need forward declaration. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f4639636b6 |
btrfs: change return type to bool in btrfs_extent_readonly
btrfs_extent_readonly() checks if the block group is readonly, the bool return type should be used. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
05947ae186 |
btrfs: unexport btrfs_extent_readonly() and make it static
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So move it from extent-tree.c to inode.c and declare it as static. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b6e9f16c5f |
btrfs: replace open coded while loop with proper construct
btrfs_inc_block_group_ro wants to ensure that the current transaction is
not running dirty block groups, if it is it waits and loops again.
That logic is currently implemented using a goto label. Actually using
a proper do {} while() construct doesn't hurt readability nor does it
introduce excessive nesting and makes the relevant code stand out by
being encompassed in the loop construct. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
20bbf20e95 |
btrfs: replace offset_in_entry with in_range
No point in duplicating the functionality just use the generic helper that has the same semantics. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cca5de97ae |
btrfs: make find_desired_extent take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bfc78479eb |
btrfs: make btrfs_replace_file_extents take btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0b3dcd131d |
btrfs: fix comment for btrfs ordered extent flag bits
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit
|
|
|
|
97fc297754 |
btrfs: convert to fileattr
Use the fileattr API to let the VFS handle locking, permission checking and conversion. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Cc: David Sterba <dsterba@suse.com> |
|
|
|
7d90072491 |
for-5.12-rc6-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBy9DoACgkQxWXV+ddt WDtqdxAAnK4zx79k5ok6nlj8JlOfReimX4wPYYigiiKGY40cfQUZ1YUqbDscvrt+ cbzvqJuMU/V/UVaPW/CLmNi5XpNlSmj0229iwy59BIcpXfgtAMTsa1zsY4teZ/AT 3noNuT15CTeybwii0nT++AkqJbCbwXc5ItccGh9ZMOQwXuA5IUVTAzKrulUJoxXN zt23lX/ivtSfUH+pMMIG6wMVG2eGIP5m9drw+2n0yK08gt+oprLYnaAaE389mXgb TIRBafeBY7UA1YEcA4JDBDMNa0L8yWSV+XiMhxw7Ear7KoROAunKNbsG8USll6zb zBftfO+Gzv86wVvvPXg2KR8Qs9vyJMw2bOROFKzOnd+wQQ76v0XefOhNUUN98E6g tLTmCH+M1B1Qm1j2hVyOect/PMY51xqJA9xwlTtAbqIcz4qyOtfTR9KqqlWxVKJW 9pAEMII063xEKVxgv2khOhewEjOgqa4v9YFQjVXdcHPKvGTAYBeoJA735+WnQ1HZ okPC5k3DoEcVZEkUPvespEsAqm+RoBufNxWmQ7hq5N3IwZAXsIwTlhysgrXQWyc9 aTigWBq6rQ/bMz/57vI626+MAMh3StL+UOxlWiT+GToInpjZwoxZ0lgQdD6vUfUm T90T2930+PTkykQM9sNdQygGiH0J5FzkvneYvpkOYJ/+vphsRiA= =MuRt -----END PGP SIGNATURE----- Merge tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One more patch that we'd like to get to 5.12 before release. It's changing where and how the superblock is stored in the zoned mode. It is an on-disk format change but so far there are no implications for users as the proper mkfs support hasn't been merged and is waiting for the kernel side to settle. Until now, the superblocks were derived from the zone index, but zone size can differ per device. This is changed to be based on fixed offset values, to make it independent of the device zone size. The work on that got a bit delayed, we discussed the exact locations to support potential device sizes and usecases. (Partially delayed also due to my vacation.) Having that in the same release where the zoned mode is declared usable is highly desired, there are userspace projects that need to be updated to recognize the feature. Pushing that to the next release would make things harder to test" * tag 'for-5.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: zoned: move superblock logging zone location |
|
|
|
53b74fa990 |
btrfs: zoned: move superblock logging zone location
Moves the location of the superblock logging zones. The new locations of the logging zones are now determined based on fixed block addresses instead of on fixed zone numbers. The old placement method based on fixed zone numbers causes problems when one needs to inspect a file system image without access to the drive zone information. In such case, the super block locations cannot be reliably determined as the zone size is unknown. By locating the superblock logging zones using fixed addresses, we can scan a dumped file system image without the zone information since a super block copy will always be present at or after the fixed known locations. Introduce the following three pairs of zones containing fixed offset locations, regardless of the device zone size. - primary superblock: offset 0B (and the following zone) - first copy: offset 512G (and the following zone) - Second copy: offset 4T (4096G, and the following zone) If a logging zone is outside of the disk capacity, we do not record the superblock copy. The first copy position is much larger than for a non-zoned filesystem, which is at 64M. This is to avoid overlapping with the log zones for the primary superblock. This higher location is arbitrary but allows supporting devices with very large zone sizes, plus some space around in between. Such large zone size is unrealistic and very unlikely to ever be seen in real devices. Currently, SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be in the 1-4GB range, so this limit gives us room to breathe. For now, we only allow zone sizes up to 8GB. The maximum zone size that would still fit in the space is 256G. The fixed location addresses are somewhat arbitrary, with the intent of maintaining superblock reliability for smaller and larger devices, with the preference for the latter. For this reason, there are two superblocks under the first 1T. This should cover use cases for physical devices and for emulated/device-mapper devices. The superblock logging zones are reserved for superblock logging and never used for data or metadata blocks. Note that we only reserve the two zones per primary/copy actually used for superblock logging. We do not reserve the ranges of zones possibly containing superblocks with the largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G). The zones containing the fixed location offsets used to store superblocks on a non-zoned volume are also reserved to avoid confusion. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4f0f586bf0 |
treewide: Change list_sort to use const pointers
list_sort() internally casts the comparison function passed to it to a different type with constant struct list_head pointers, and uses this pointer to call the functions, which trips indirect call Control-Flow Integrity (CFI) checking. Instead of removing the consts, this change defines the list_cmp_func_t type and changes the comparison function types of all list_sort() callers to use const pointers, thus avoiding type mismatches. Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kees Cook <keescook@chromium.org> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Kees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com |
|
|
|
701c09c988 |
for-5.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBctBgACgkQxWXV+ddt
WDu1nA//bzuPwW3nO+enE+ipi4t6UJTJpHLeDgdMshWwhBIHVt+oFxTUIt4Zd0kT
0hJ+mbNrZHzmDmzpb6ifQn0D6k+wq6zbsEgLtwgmPmBszaXIw46FvnYnxd9FtCde
9SQzBKa86i/KMkRtaIvpUcunniIo5Aj0Hvu0oPgTKObqiB4HP2nV6rKody+mP9JW
RanWbBi0JvI4UE/J2Ud1sNWFdDtVpXpcktj1dsI8gbsYNR05HpM08SEUgeF/ts3I
yB/L18I5CUeFHyo/yogbj7kkikugPGsmOj/A86UZ6x3NxWoC+m7UXoGrO2/qlFem
qd3ioXZKlnPqeX29kAy/REa3xjE61istlDVC/vckqmXBfYc6WK/KAJvFAGI+/3VI
9HvIbBokUQzekhFlA02RTqGcasStXX7VSeJyzyAbXjGhZQKfFTHR8ZBtrREiVBC9
58K+g8SSqIb/9iJqYV4h82lSBRSdf9kHx7CSB2gOBuifihY+chVr4Xzhq12IlXbK
TNlue0BTwYLJStwx2dnY2beLbLG34/4FNRsuAR/9JsCio7Bfj0qN8htIyvfsiMxr
mkrH7+Ykd10FqC8uu6MHiW9k428871Era3B97TgyQ0V17ehh4IN0v9V7kckk9EWw
3omaPwuF2FGfFOoTR7ipKO0nDx0/y2knnDSTsWknNG09Ciwa+Ww=
=SuJv
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Fixes for issues that have some user visibility and are simple enough
for this time of development cycle:
- a few fixes for rescue= mount option, adding more checks for
missing trees
- fix sleeping in atomic context on qgroup deletion
- fix subvolume deletion on mount
- fix build with M= syntax
- fix checksum mismatch error message for direct io"
* tag 'for-5.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix check_data_csum() error message for direct I/O
btrfs: fix sleep while in non-sleep context during qgroup removal
btrfs: fix subvolume/snapshot deletion not triggered on mount
btrfs: fix build when using M=fs/btrfs
btrfs: do not initialize dev replace for bad dev root
btrfs: initialize device::fs_info always
btrfs: do not initialize dev stats if we have no dev_root
btrfs: zoned: remove outdated WARN_ON in direct IO
|
|
|
|
81aa0968b7 |
for-5.12-rc3-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBTeBsACgkQxWXV+ddt
WDtwcBAAoto5Pbc3Lvt0aha3qn9q/Ms9lNU3YIwTjqXV3lIRKksWCS7kQmWlFmLz
dILhdRBg1iWVh8qbeqpL5su7yNJduypsY/ImJroukb/BzwQViFRDGy5qIc56qLH2
OVTx4LQ0zdqVdD86Qj0mt9ilSjgXYN+J53IUjsSSyJIpgt3vVcfjCYSkFO8zBiMH
eliRtYShzJHkjEwVWLZRzk76oTnFQEC28IdYJ4y95mYl2wCABfTU2ylSeVDTtc6O
x+fNMHHRmde2nbsHc+0eMm7rYLXuzvyx/tY17u6A6iwEQLGjE4rXOVZ7kA93WgAd
YTXhM/B+YFfirNh029Av/MJP+2t9YBEODAHl1tnOdM0mfvXkpimaW0jvUEhi5f6I
ZGu5FytscsgjyUK827WL7bZKO8WMzTLQvB3ryZ9UcrHm3QbZ7xGdoBE2L86p4Euw
LiXUALdOWeYjFKSW9WWKrtQBtdjlLQYqJt+hL0ifaGlnfoi2G+DQeKtL9ZAKH5Cu
gcjDUewnJtYPLyDOCRjQPFcts/MD5o81qMLeEwshmZT/bNMD9JOGEppCxBWGWSCx
dYGq04Wib/dN710i5jB1XbJboBmT2SZDyBeiKTpCXs5mECBU00uWkkO98oId1YS3
wHu9qyGUOi2g88V27jH593/JstUYn6zyxJYIZX84mzcxOqZlKuo=
=auMP
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"There are still regressions being found and fixed in the zoned mode
and subpage code, the rest are fixes for bugs reported by users.
Regressions:
- subpage block support:
- readahead works on the proper block size
- fix last page zeroing
- zoned mode:
- linked list corruption for tree log
Fixes:
- qgroup leak after falloc failure
- tree mod log and backref resolving:
- extent buffer cloning race when resolving backrefs
- pin deleted leaves with active tree mod log users
- drop debugging flag from slab cache"
* tag 'for-5.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: always pin deleted leaves when there are active tree mod log users
btrfs: fix race when cloning extent buffer during rewind of an old root
btrfs: fix slab cache flags for free space tree bitmap
btrfs: subpage: make readahead work properly
btrfs: subpage: fix wild pointer access during metadata read failure
btrfs: zoned: fix linked list corruption after log root tree allocation failure
btrfs: fix qgroup data rsv leak caused by falloc failure
btrfs: track qgroup released data in own variable in insert_prealloc_file_extent
btrfs: fix wrong offset to zero out range beyond i_size
|
|
|
|
c1d6abdac4 |
btrfs: fix check_data_csum() error message for direct I/O
Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
check_data_csum()") replaced the start parameter to check_data_csum()
with page_offset(), but page_offset() is not meaningful for direct I/O
pages. Bring back the start parameter.
Fixes:
|
|
|
|
0bb7883009 |
btrfs: fix sleep while in non-sleep context during qgroup removal
While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
producing the following trace:
[821.843637] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:281
[821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, name: podman
[821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: G W 5.11.6 #15
[821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 12/08/2020
[821.843647] Call Trace:
[821.843650] dump_stack+0xa1/0xfb
[821.843656] ___might_sleep+0x144/0x160
[821.843659] mutex_lock+0x17/0x40
[821.843662] kernfs_remove_by_name_ns+0x1f/0x80
[821.843666] sysfs_remove_group+0x7d/0xe0
[821.843668] sysfs_remove_groups+0x28/0x40
[821.843670] kobject_del+0x2a/0x80
[821.843672] btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
[821.843685] __del_qgroup_rb+0x12/0x150 [btrfs]
[821.843696] btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
[821.843707] btrfs_ioctl+0x3129/0x36a0 [btrfs]
[821.843717] ? __mod_lruvec_page_state+0x5e/0xb0
[821.843719] ? page_add_new_anon_rmap+0xbc/0x150
[821.843723] ? kfree+0x1b4/0x300
[821.843725] ? mntput_no_expire+0x55/0x330
[821.843728] __x64_sys_ioctl+0x5a/0xa0
[821.843731] do_syscall_64+0x33/0x70
[821.843733] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[821.843736] RIP: 0033:0x4cd3fb
[821.843741] RSP: 002b:000000c000906b20 EFLAGS: 00000206 ORIG_RAX: 0000000000000010
[821.843744] RAX: ffffffffffffffda RBX: 000000c000050000 RCX: 00000000004cd3fb
[821.843745] RDX: 000000c000906b98 RSI: 000000004010942a RDI: 000000000000000f
[821.843747] RBP: 000000c000907cd0 R08: 000000c000622901 R09: 0000000000000000
[821.843748] R10: 000000c000d992c0 R11: 0000000000000206 R12: 000000000000012d
[821.843749] R13: 000000000000012c R14: 0000000000000200 R15: 0000000000000049
Fix this by removing the qgroup sysfs entry while not holding the spinlock,
since the spinlock is only meant for protection of the qgroup rbtree.
Reported-by: Stuart Shelton <srcshelton@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7A5485BB-0628-419D-A4D3-27B1AF47E25A@gmail.com/
Fixes:
|
|
|
|
8d488a8c7b |
btrfs: fix subvolume/snapshot deletion not triggered on mount
During the mount procedure we are calling btrfs_orphan_cleanup() against
the root tree, which will find all orphans items in this tree. When an
orphan item corresponds to a deleted subvolume/snapshot (instead of an
inode space cache), it must not delete the orphan item, because that will
cause btrfs_find_orphan_roots() to not find the orphan item and therefore
not add the corresponding subvolume root to the list of dead roots, which
results in the subvolume's tree never being deleted by the cleanup thread.
The same applies to the remount from RO to RW path.
Fix this by making btrfs_find_orphan_roots() run before calling
btrfs_orphan_cleanup() against the root tree.
A test case for fstests will follow soon.
Reported-by: Robbie Ko <robbieko@synology.com>
Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
Fixes:
|
|
|
|
ebd99a6b34 |
btrfs: fix build when using M=fs/btrfs
There are people building the module with M= that's supposed to be used
for external modules. This got broken in
|
|
|
|
3cb894972f |
btrfs: do not initialize dev replace for bad dev root
While helping Neal fix his broken file system I added a debug patch to catch if we were calling btrfs_search_slot with a NULL root, and this stack trace popped: we tried to search with a NULL root CPU: 0 PID: 1760 Comm: mount Not tainted 5.11.0-155.nealbtrfstest.1.fc34.x86_64 #1 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020 Call Trace: dump_stack+0x6b/0x83 btrfs_search_slot.cold+0x11/0x1b ? btrfs_init_dev_replace+0x36/0x450 btrfs_init_dev_replace+0x71/0x450 open_ctree+0x1054/0x1610 btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f644730352e Fix this by not starting the device replace stuff if we do not have a NULL dev root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
820a49dafc |
btrfs: initialize device::fs_info always
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 696 Comm: mount Tainted: G W 5.12.0-rc2+ #296 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200 RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070 RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400 RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150 R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000 R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150 FS: 00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 ? kmem_cache_alloc+0xef/0x1f0 btrfs_init_dev_stats+0x5f/0xf0 open_ctree+0x10cb/0x1720 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x433/0xa00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae This happens because when we call btrfs_init_dev_stats we do device->fs_info->dev_root. However device->fs_info isn't initialized because we were only calling btrfs_init_devices_late() if we properly read the device root. However we don't actually need the device root to init the devices, this function simply assigns the devices their ->fs_info pointer properly, so this needs to be done unconditionally always so that we can properly dereference device->fs_info in rescue cases. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
82d62d06db |
btrfs: do not initialize dev stats if we have no dev_root
Neal reported a panic trying to use -o rescue=all BUG: kernel NULL pointer dereference, address: 0000000000000030 PGD 0 P4D 0 Oops: 0000 [#1] SMP PTI CPU: 0 PID: 4095 Comm: mount Not tainted 5.11.0-0.rc7.149.fc34.x86_64 #1 RIP: 0010:btrfs_device_init_dev_stats+0x4c/0x1f0 RSP: 0018:ffffa60285fbfb68 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88b88f806498 RCX: ffff88b82e7a2a10 RDX: ffffa60285fbfb97 RSI: ffff88b82e7a2a10 RDI: 0000000000000000 RBP: ffff88b88f806b3c R08: 0000000000000000 R09: 0000000000000000 R10: ffff88b82e7a2a10 R11: 0000000000000000 R12: ffff88b88f806a00 R13: ffff88b88f806478 R14: ffff88b88f806a00 R15: ffff88b82e7a2a10 FS: 00007f698be1ec40(0000) GS:ffff88b937e00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000030 CR3: 0000000092c9c006 CR4: 00000000003706f0 Call Trace: ? btrfs_init_dev_stats+0x1f/0xf0 btrfs_init_dev_stats+0x62/0xf0 open_ctree+0x1019/0x15ff btrfs_mount_root.cold+0x13/0xfa legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x131/0x3d0 ? legacy_get_tree+0x27/0x40 ? btrfs_show_options+0x640/0x640 legacy_get_tree+0x27/0x40 vfs_get_tree+0x25/0xb0 path_mount+0x441/0xa80 __x64_sys_mount+0xf4/0x130 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f698c04e52e This happens because we unconditionally attempt to initialize device stats on mount, but we may not have been able to read the device root. Fix this by skipping initializing the device stats if we do not have a device root. Reported-by: Neal Gompa <ngompa13@gmail.com> CC: stable@vger.kernel.org # 5.11+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3da882eae |
btrfs: zoned: remove outdated WARN_ON in direct IO
In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if we're submitting a DIO write on a zoned filesystem but are not using REQ_OP_ZONE_APPEND to submit the IO to the block device. This is a left over from a previous version where btrfs_dio_iomap_begin() didn't use btrfs_use_zone_append() to check for sequential write only zones. It is an oversight from the development phase. In v11 (I think) I've added |
|
|
|
485df75554 |
btrfs: always pin deleted leaves when there are active tree mod log users
When freeing a tree block we may end up adding its extent back to the free space cache/tree, as long as there are no more references for it, it was created in the current transaction and writeback for it never happened. This is generally fine, however when we have tree mod log operations it can result in inconsistent versions of a btree after unwinding extent buffers with the recorded tree mod log operations. This is because: * We only log operations for nodes (adding and removing key/pointers), for leaves we don't do anything; * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation for a node that points to a leaf that was deleted; * Before we apply the logged operation to unwind a node, we can have that leaf's extent allocated again, either as a node or as a leaf, and possibly for another btree. This is possible if the leaf was created in the current transaction and writeback for it never started, in which case btrfs_free_tree_block() returns its extent back to the free space cache/tree; * Then, before applying the tree mod log operation, some task allocates the metadata extent just freed before, and uses it either as a leaf or as a node for some btree (can be the same or another one, it does not matter); * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now get the target node with an item pointing to the metadata extent that now has content different from what it had before the leaf was deleted. It might now belong to a different btree and be a node and not a leaf anymore. As a consequence, the results of searches after the unwinding can be unpredictable and produce unexpected results. So make sure we pin extent buffers corresponding to leaves when there are tree mod log users. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dbcc7d57bf |
btrfs: fix race when cloning extent buffer during rewind of an old root
While resolving backreferences, as part of a logical ino ioctl call or
fiemap, we can end up hitting a BUG_ON() when replaying tree mod log
operations of a root, triggering a stack trace like the following:
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.c:1210!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 19054 Comm: crawl_335 Tainted: G W 5.11.0-2d11c0084b02-misc-next+ #89
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
RIP: 0010:__tree_mod_log_rewind+0x3b1/0x3c0
Code: 05 48 8d 74 10 (...)
RSP: 0018:ffffc90001eb70b8 EFLAGS: 00010297
RAX: 0000000000000000 RBX: ffff88812344e400 RCX: ffffffffb28933b6
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88812344e42c
RBP: ffffc90001eb7108 R08: 1ffff11020b60a20 R09: ffffed1020b60a20
R10: ffff888105b050f9 R11: ffffed1020b60a1f R12: 00000000000000ee
R13: ffff8880195520c0 R14: ffff8881bc958500 R15: ffff88812344e42c
FS: 00007fd1955e8700(0000) GS:ffff8881f5600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007efdb7928718 CR3: 000000010103a006 CR4: 0000000000170ee0
Call Trace:
btrfs_search_old_slot+0x265/0x10d0
? lock_acquired+0xbb/0x600
? btrfs_search_slot+0x1090/0x1090
? free_extent_buffer.part.61+0xd7/0x140
? free_extent_buffer+0x13/0x20
resolve_indirect_refs+0x3e9/0xfc0
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? add_prelim_ref.part.11+0x150/0x150
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? do_raw_spin_unlock+0xa8/0x140
? rb_insert_color+0x30/0x360
? prelim_ref_insert+0x12d/0x430
find_parent_nodes+0x5c3/0x1830
? resolve_indirect_refs+0xfc0/0xfc0
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x160/0x210
? lock_release+0xc8/0x620
? fs_reclaim_acquire+0x67/0xf0
? lock_acquire+0xc7/0x510
? poison_range+0x38/0x40
? unpoison_range+0x14/0x40
? trace_hardirqs_on+0x55/0x120
btrfs_find_all_roots_safe+0x142/0x1e0
? find_parent_nodes+0x1830/0x1830
? btrfs_inode_flags_to_xflags+0x50/0x50
iterate_extent_inodes+0x20e/0x580
? tree_backref_for_extent+0x230/0x230
? lock_downgrade+0x3d0/0x3d0
? read_extent_buffer+0xdd/0x110
? lock_downgrade+0x3d0/0x3d0
? __kasan_check_read+0x11/0x20
? lock_acquired+0xbb/0x600
? __kasan_check_write+0x14/0x20
? _raw_spin_unlock+0x22/0x30
? __kasan_check_write+0x14/0x20
iterate_inodes_from_logical+0x129/0x170
? iterate_inodes_from_logical+0x129/0x170
? btrfs_inode_flags_to_xflags+0x50/0x50
? iterate_extent_inodes+0x580/0x580
? __vmalloc_node+0x92/0xb0
? init_data_container+0x34/0xb0
? init_data_container+0x34/0xb0
? kvmalloc_node+0x60/0x80
btrfs_ioctl_logical_to_ino+0x158/0x230
btrfs_ioctl+0x205e/0x4040
? __might_sleep+0x71/0xe0
? btrfs_ioctl_get_supported_features+0x30/0x30
? getrusage+0x4b6/0x9c0
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __might_fault+0x64/0xd0
? lock_acquire+0xc7/0x510
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? lock_downgrade+0x3d0/0x3d0
? lockdep_hardirqs_on_prepare+0x210/0x210
? __kasan_check_read+0x11/0x20
? lock_release+0xc8/0x620
? __task_pid_nr_ns+0xd3/0x250
? lock_acquire+0xc7/0x510
? __fget_files+0x160/0x230
? __fget_light+0xf2/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fd1976e2427
Code: 00 00 90 48 8b 05 (...)
RSP: 002b:00007fd1955e5cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00007fd1955e5f40 RCX: 00007fd1976e2427
RDX: 00007fd1955e5f48 RSI: 00000000c038943b RDI: 0000000000000004
RBP: 0000000001000000 R08: 0000000000000000 R09: 00007fd1955e6120
R10: 0000557835366b00 R11: 0000000000000246 R12: 0000000000000004
R13: 00007fd1955e5f48 R14: 00007fd1955e5f40 R15: 00007fd1955e5ef8
Modules linked in:
---[ end trace ec8931a1c36e57be ]---
(gdb) l *(__tree_mod_log_rewind+0x3b1)
0xffffffff81893521 is in __tree_mod_log_rewind (fs/btrfs/ctree.c:1210).
1205 * the modification. as we're going backwards, we do the
1206 * opposite of each operation here.
1207 */
1208 switch (tm->op) {
1209 case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
1210 BUG_ON(tm->slot < n);
1211 fallthrough;
1212 case MOD_LOG_KEY_REMOVE_WHILE_MOVING:
1213 case MOD_LOG_KEY_REMOVE:
1214 btrfs_set_node_key(eb, &tm->key, tm->slot);
Here's what happens to hit that BUG_ON():
1) We have one tree mod log user (through fiemap or the logical ino ioctl),
with a sequence number of 1, so we have fs_info->tree_mod_seq == 1;
2) Another task is at ctree.c:balance_level() and we have eb X currently as
the root of the tree, and we promote its single child, eb Y, as the new
root.
Then, at ctree.c:balance_level(), we call:
tree_mod_log_insert_root(eb X, eb Y, 1);
3) At tree_mod_log_insert_root() we create tree mod log elements for each
slot of eb X, of operation type MOD_LOG_KEY_REMOVE_WHILE_FREEING each
with a ->logical pointing to ebX->start. These are placed in an array
named tm_list.
Lets assume there are N elements (N pointers in eb X);
4) Then, still at tree_mod_log_insert_root(), we create a tree mod log
element of operation type MOD_LOG_ROOT_REPLACE, ->logical set to
ebY->start, ->old_root.logical set to ebX->start, ->old_root.level set
to the level of eb X and ->generation set to the generation of eb X;
5) Then tree_mod_log_insert_root() calls tree_mod_log_free_eb() with
tm_list as argument. After that, tree_mod_log_free_eb() calls
__tree_mod_log_insert() for each member of tm_list in reverse order,
from highest slot in eb X, slot N - 1, to slot 0 of eb X;
6) __tree_mod_log_insert() sets the sequence number of each given tree mod
log operation - it increments fs_info->tree_mod_seq and sets
fs_info->tree_mod_seq as the sequence number of the given tree mod log
operation.
This means that for the tm_list created at tree_mod_log_insert_root(),
the element corresponding to slot 0 of eb X has the highest sequence
number (1 + N), and the element corresponding to the last slot has the
lowest sequence number (2);
7) Then, after inserting tm_list's elements into the tree mod log rbtree,
the MOD_LOG_ROOT_REPLACE element is inserted, which gets the highest
sequence number, which is N + 2;
8) Back to ctree.c:balance_level(), we free eb X by calling
btrfs_free_tree_block() on it. Because eb X was created in the current
transaction, has no other references and writeback did not happen for
it, we add it back to the free space cache/tree;
9) Later some other task T allocates the metadata extent from eb X, since
it is marked as free space in the space cache/tree, and uses it as a
node for some other btree;
10) The tree mod log user task calls btrfs_search_old_slot(), which calls
get_old_root(), and finally that calls __tree_mod_log_oldest_root()
with time_seq == 1 and eb_root == eb Y;
11) First iteration of the while loop finds the tree mod log element with
sequence number N + 2, for the logical address of eb Y and of type
MOD_LOG_ROOT_REPLACE;
12) Because the operation type is MOD_LOG_ROOT_REPLACE, we don't break out
of the loop, and set root_logical to point to tm->old_root.logical
which corresponds to the logical address of eb X;
13) On the next iteration of the while loop, the call to
tree_mod_log_search_oldest() returns the smallest tree mod log element
for the logical address of eb X, which has a sequence number of 2, an
operation type of MOD_LOG_KEY_REMOVE_WHILE_FREEING and corresponds to
the old slot N - 1 of eb X (eb X had N items in it before being freed);
14) We then break out of the while loop and return the tree mod log operation
of type MOD_LOG_ROOT_REPLACE (eb Y), and not the one for slot N - 1 of
eb X, to get_old_root();
15) At get_old_root(), we process the MOD_LOG_ROOT_REPLACE operation
and set "logical" to the logical address of eb X, which was the old
root. We then call tree_mod_log_search() passing it the logical
address of eb X and time_seq == 1;
16) Then before calling tree_mod_log_search(), task T adds a key to eb X,
which results in adding a tree mod log operation of type
MOD_LOG_KEY_ADD to the tree mod log - this is done at
ctree.c:insert_ptr() - but after adding the tree mod log operation
and before updating the number of items in eb X from 0 to 1...
17) The task at get_old_root() calls tree_mod_log_search() and gets the
tree mod log operation of type MOD_LOG_KEY_ADD just added by task T.
Then it enters the following if branch:
if (old_root && tm && tm->op != MOD_LOG_KEY_REMOVE_WHILE_FREEING) {
(...)
} (...)
Calls read_tree_block() for eb X, which gets a reference on eb X but
does not lock it - task T has it locked.
Then it clones eb X while it has nritems set to 0 in its header, before
task T sets nritems to 1 in eb X's header. From hereupon we use the
clone of eb X which no other task has access to;
18) Then we call __tree_mod_log_rewind(), passing it the MOD_LOG_KEY_ADD
mod log operation we just got from tree_mod_log_search() in the
previous step and the cloned version of eb X;
19) At __tree_mod_log_rewind(), we set the local variable "n" to the number
of items set in eb X's clone, which is 0. Then we enter the while loop,
and in its first iteration we process the MOD_LOG_KEY_ADD operation,
which just decrements "n" from 0 to (u32)-1, since "n" is declared with
a type of u32. At the end of this iteration we call rb_next() to find the
next tree mod log operation for eb X, that gives us the mod log operation
of type MOD_LOG_KEY_REMOVE_WHILE_FREEING, for slot 0, with a sequence
number of N + 1 (steps 3 to 6);
20) Then we go back to the top of the while loop and trigger the following
BUG_ON():
(...)
switch (tm->op) {
case MOD_LOG_KEY_REMOVE_WHILE_FREEING:
BUG_ON(tm->slot < n);
fallthrough;
(...)
Because "n" has a value of (u32)-1 (4294967295) and tm->slot is 0.
Fix this by taking a read lock on the extent buffer before cloning it at
ctree.c:get_old_root(). This should be done regardless of the extent
buffer having been freed and reused, as a concurrent task might be
modifying it (while holding a write lock on it).
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/20210227155037.GN28049@hungrycats.org/
Fixes:
|
|
|
|
34e49994d0 |
btrfs: fix slab cache flags for free space tree bitmap
The free space tree bitmap slab cache is created with SLAB_RED_ZONE but
that's a debugging flag and not always enabled. Also the other slabs are
created with at least SLAB_MEM_SPREAD that we want as well to average
the memory placement cost.
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Fixes:
|
|
|
|
60484cd9d5 |
btrfs: subpage: make readahead work properly
In readahead infrastructure, we are using a lot of hard coded PAGE_SHIFT while we're not doing anything specific to PAGE_SIZE. One of the most affected part is the radix tree operation of btrfs_fs_info::reada_tree. If using PAGE_SHIFT, subpage metadata readahead is broken and does no help reading metadata ahead. Fix the problem by using btrfs_fs_info::sectorsize_bits so that readahead could work for subpage. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d9bb77d51e |
btrfs: subpage: fix wild pointer access during metadata read failure
[BUG] When running fstests for btrfs subpage read-write test, it has a very high chance to crash at generic/475 with the following stack: BTRFS warning (device dm-8): direct IO failed ino 510 rw 1,34817 sector 0xcdf0 len 94208 err no 10 Unable to handle kernel paging request at virtual address ffff80001157e7c0 CPU: 2 PID: 687125 Comm: kworker/u12:4 Tainted: G WC 5.12.0-rc2-custom+ #5 Hardware name: Khadas VIM3 (DT) Workqueue: btrfs-endio-meta btrfs_work_helper [btrfs] pc : queued_spin_lock_slowpath+0x1a0/0x390 lr : do_raw_spin_lock+0xc4/0x11c Call trace: queued_spin_lock_slowpath+0x1a0/0x390 _raw_spin_lock+0x68/0x84 btree_readahead_hook+0x38/0xc0 [btrfs] end_bio_extent_readpage+0x504/0x5f4 [btrfs] bio_endio+0x170/0x1a4 end_workqueue_fn+0x3c/0x60 [btrfs] btrfs_work_helper+0x1b0/0x1b4 [btrfs] process_one_work+0x22c/0x430 worker_thread+0x70/0x3a0 kthread+0x13c/0x140 ret_from_fork+0x10/0x30 Code: 910020e0 8b0200c2 f861d884 aa0203e1 (f8246827) [CAUSE] In end_bio_extent_readpage(), if we hit an error during read, we will handle the error differently for data and metadata. For data we queue a repair, while for metadata, we record the error and let the caller choose what to do. But the code is still using page->private to grab extent buffer, which no longer points to extent buffer for subpage metadata pages. Thus this wild pointer access leads to above crash. [FIX] Introduce a helper, find_extent_buffer_readpage(), to grab extent buffer. The difference against find_extent_buffer_nospinlock() is: - Also handles regular sectorsize == PAGE_SIZE case - No extent buffer refs increase/decrease As extent buffer under IO must have non-zero refs, so this is safe Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e3d3b41576 |
btrfs: zoned: fix linked list corruption after log root tree allocation failure
When using a zoned filesystem, while syncing the log, if we fail to
allocate the root node for the log root tree, we are not removing the
log context we allocated on stack from the list of log contexts of the
log root tree. This means after the return from btrfs_sync_log() we get
a corrupted linked list.
Fix this by allocating the node before adding our stack allocated context
to the list of log contexts of the log root tree.
Fixes:
|
|
|
|
a3ee79bd8f |
btrfs: fix qgroup data rsv leak caused by falloc failure
[BUG]
When running fsstress with only falloc workload, and a very low qgroup
limit set, we can get qgroup data rsv leak at unmount time.
BTRFS warning (device dm-0): qgroup 0/5 has unreleased space, type 0 rsv 20480
BTRFS error (device dm-0): qgroup reserved space leaked
The minimal reproducer looks like:
#!/bin/bash
dev=/dev/test/test
mnt="/mnt/btrfs"
fsstress=~/xfstests-dev/ltp/fsstress
runtime=8
workload()
{
umount $dev &> /dev/null
umount $mnt &> /dev/null
mkfs.btrfs -f $dev > /dev/null
mount $dev $mnt
btrfs quota en $mnt
btrfs quota rescan -w $mnt
btrfs qgroup limit 16m 0/5 $mnt
$fsstress -w -z -f creat=10 -f fallocate=10 -p 2 -n 100 \
-d $mnt -v > /tmp/fsstress
umount $mnt
if dmesg | grep leak ; then
echo "!!! FAILED !!!"
exit 1
fi
}
for (( i=0; i < $runtime; i++)); do
echo "=== $i/$runtime==="
workload
done
Normally it would fail before round 4.
[CAUSE]
In function insert_prealloc_file_extent(), we first call
btrfs_qgroup_release_data() to know how many bytes are reserved for
qgroup data rsv.
Then use that @qgroup_released number to continue our work.
But after we call btrfs_qgroup_release_data(), we should either queue
@qgroup_released to delayed ref or free them manually in error path.
Unfortunately, we lack the error handling to free the released bytes,
leaking qgroup data rsv.
All the error handling function outside won't help at all, as we have
released the range, meaning in inode io tree, the EXTENT_QGROUP_RESERVED
bit is already cleared, thus all btrfs_qgroup_free_data() call won't
free any data rsv.
[FIX]
Add free_qgroup tag to manually free the released qgroup data rsv.
Reported-by: Nikolay Borisov <nborisov@suse.com>
Reported-by: David Sterba <dsterba@suse.cz>
Fixes:
|
|
|
|
fbf48bb0b1 |
btrfs: track qgroup released data in own variable in insert_prealloc_file_extent
There is a piece of weird code in insert_prealloc_file_extent(), which
looks like:
ret = btrfs_qgroup_release_data(inode, file_offset, len);
if (ret < 0)
return ERR_PTR(ret);
if (trans) {
ret = insert_reserved_file_extent(trans, inode,
file_offset, &stack_fi,
true, ret);
...
}
extent_info.is_new_extent = true;
extent_info.qgroup_reserved = ret;
...
Note how the variable @ret is abused here, and if anyone is adding code
just after btrfs_qgroup_release_data() call, it's super easy to
overwrite the @ret and cause tons of qgroup related bugs.
Fix such abuse by introducing new variable @qgroup_released, so that we
won't reuse the existing variable @ret.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
d2dcc8ed8e |
btrfs: fix wrong offset to zero out range beyond i_size
[BUG] The test generic/091 fails , with the following output: fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W mapped writes DISABLED Seed set to 1 main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling! main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling! skipping zero size read truncating to largest ever: 0xe400 copying to largest ever: 0x1f400 cloning to largest ever: 0x70000 cloning to largest ever: 0x77000 fallocating to largest ever: 0x7a120 Mapped Read: non-zero data past EOF (0x3a7ff) page offset 0x800 is 0xf2e1 <<< ... [CAUSE] In commit |
|
|
|
ce307084c9 |
block-5.12-2021-03-12-v2
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBLzKsQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpi0ID/9djN1db0OrAjQgWdOQsKwzcPG4fmVRHJAu
Zi8SPRj0ByonWGaPWjiSi297/j00dfYFFIXaB1Pfo4j0wX0IK8bJINl0G8SN6Dag
WYBBrT/5rCQgD8fjQ1XhuzuqLwxwcZfYXAnCAlqABG18nPk532D4dX2CMEasl8F7
XWTTj5PqHDN4bCcriH1GEA5S+2nmoz5YXjNZEDcY3/pQMdyb8Jo9mRfZubkrnRxK
c9fz2LjUz0IRaSb+9PILY5qDLOSIh+vHOIk/3BKW9DoqU/S3kTTr4twqnOclfVPH
VgJM9b+sHveVCztCJ9bnNGkW7HWjUQa8gb/B40NBxKEhw7w/HCjykhhxd+QTUQTM
GJVMRGYWhzuUEuU1M1hArPua0GLmPKSvC0CRgbKRmgPNjshTquZPJnBBFwv2wZKQ
GkrwktdK9ihE1ya4gu20MupST3PIpT3jtc6NAizr6DCy0wJ0Z1X5KYnFdbtS79No
I9qPC8lu3AcZq6NXdBfTO9ngIdiUwi9AfSYj7koS/4dmnVccVJmaj0/NNmVp2Ro3
HtaObanBnTi9v8YHl8WgX6lq5RjuQ204fXmd0No4mHFvgxsl7YaX+JBts7S3A2Nf
PoQLqmulcLmzT3EVuEg279aXw2rbnyWHARbF/5/tIr4JcugtLJhwFnBA5YgFreq9
lSbqgoKSHw==
=qHyO
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Mostly just random fixes all over the map.
The only odd-one-out change is finally getting the rename of
BIO_MAX_PAGES to BIO_MAX_VECS done. This should've been done with the
multipage bvec change, but it's been left.
Do it now to avoid hassles around changes piling up for the next merge
window.
Summary:
- NVMe pull request:
- one more quirk (Dmitry Monakhov)
- fix max_zone_append_sectors initialization (Chaitanya Kulkarni)
- nvme-fc reset/create race fix (James Smart)
- fix status code on aborts/resets (Hannes Reinecke)
- fix the CSS check for ZNS namespaces (Chaitanya Kulkarni)
- fix a use after free in a debug printk in nvme-rdma (Lv Yunlong)
- Follow-up NVMe error fix for NULL 'id' (Christoph)
- Fixup for the bd_size_lock being IRQ safe, now that the offending
driver has been dropped (Damien).
- rsxx probe failure error return (Jia-Ju)
- umem probe failure error return (Wei)
- s390/dasd unbind fixes (Stefan)
- blk-cgroup stats summing fix (Xunlei)
- zone reset handling fix (Damien)
- Rename BIO_MAX_PAGES to BIO_MAX_VECS (Christoph)
- Suppress uevent trigger for hidden devices (Daniel)
- Fix handling of discard on busy device (Jan)
- Fix stale cache issue with zone reset (Shin'ichiro)"
* tag 'block-5.12-2021-03-12-v2' of git://git.kernel.dk/linux-block:
nvme: fix the nsid value to print in nvme_validate_or_alloc_ns
block: Discard page cache of zone reset target range
block: Suppress uevent for hidden device when removed
block: rename BIO_MAX_PAGES to BIO_MAX_VECS
nvme-pci: add the DISABLE_WRITE_ZEROES quirk for a Samsung PM1725a
nvme-rdma: Fix a use after free in nvmet_rdma_write_data_done
nvme-core: check ctrl css before setting up zns
nvme-fc: fix racing controller reset and create association
nvme-fc: return NVME_SC_HOST_ABORTED_CMD when a command has been aborted
nvme-fc: set NVME_REQ_CANCELLED in nvme_fc_terminate_exchange()
nvme: add NVME_REQ_CANCELLED flag in nvme_cancel_request()
nvme: simplify error logic in nvme_validate_ns()
nvme: set max_zone_append_sectors nvme_revalidate_zones
block: rsxx: fix error return code of rsxx_pci_probe()
block: Fix REQ_OP_ZONE_RESET_ALL handling
umem: fix error return code in mm_pci_probe()
blk-cgroup: Fix the recursive blkg rwstat
s390/dasd: fix hanging IO request during DASD driver unbind
s390/dasd: fix hanging DASD driver unbind
block: Try to handle busy underlying device on discard
|
|
|
|
a8affc03a9 |
block: rename BIO_MAX_PAGES to BIO_MAX_VECS
Ever since the addition of multipage bio_vecs BIO_MAX_PAGES has been horribly confusingly misnamed. Rename it to BIO_MAX_VECS to stop confusing users of the bio API. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20210311110137.1132391-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
f09b04cc64 |
for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmBCOi4ACgkQxWXV+ddt
WDtXvw//TWx3m05qHJqqG8V90uel8hB2J5vd4CA2r62Je1G8RDho57Bo7fyvL4l+
mdCPt+INajb0mpp0IoHMtyLHefojgNOsrX6FAK1/gjnLkjRLFZ3wQqkA34Ue9pNs
2u+rMY6eB105iaS3VejEmiebr++MZfjfQRV+GXU336AEeOEDZdgol8o6jMyde5TO
zRH9Dni5Sy/YAGGAb0vaoG2BMyVigrqkbjkzwjYChbUj/KuyffAgQj0v8BvsC9Y6
DnPD5yrt5kSZzuqQFH7c2jxLN0cvW+tJ0znCpnwn/nmiCALbl6y2a4dmewC32TwJ
II+3OPGpYudafLJEP15qafsJb7LmEfnGwUIrfEZbyb4lQG12uyYOdP3IN7+8td14
fd29GE62w5aErsmurcMFj/x43k4DIfcqC8b+Y+S27JZF1szh7ExCfoYC/6c5e5Qf
j6/6RtRSVqdxImRd0QYv3mCIeSG0CH2UR/1otvC81jRTHRyB3r6TV8wPLo+5K/Rk
ongKZ+BQa5RUk8skdFburhrkDDKgfBcjlexl5Gsqw+D/xTGNAcVnNQrTtW9sTSle
hB3b7CunXA1eCyui2SIqN1dR8hwao4b9RzYNs3y2jWjSPZD/Bp0BdQ8oxSPvIWkX
a8kauFGhKhY2Tdqau+CQ4UbbQWzEB7FulkPCOLiHDDZjyxIvAA4=
=tlU3
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More regression fixes and stabilization.
Regressions:
- zoned mode
- count zone sizes in wider int types
- fix space accounting for read-only block groups
- subpage: fix page tail zeroing
Fixes:
- fix spurious warning when remounting with free space tree
- fix warning when creating a directory with smack enabled
- ioctl checks for qgroup inheritance when creating a snapshot
- qgroup
- fix missing unlock on error path in zero range
- fix amount of released reservation on error
- fix flushing from unsafe context with open transaction,
potentially deadlocking
- minor build warning fixes"
* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: do not account freed region of read-only block group as zone_unusable
btrfs: zoned: use sector_t for zone sectors
btrfs: subpage: fix the false data csum mismatch error
btrfs: fix warning when creating a directory with smack enabled
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
btrfs: export and rename qgroup_reserve_meta
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
btrfs: fix spurious free_space_tree remount warning
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
btrfs: ref-verify: use 'inline void' keyword ordering
|
|
|
|
badae9c869 |
btrfs: zoned: do not account freed region of read-only block group as zone_unusable
We migrate zone unusable bytes to read-only bytes when a block group is
set to read-only, and account all the free region as bytes_readonly.
Thus, we should not increase block_group->zone_unusable when the block
group is read-only.
Fixes:
|
|
|
|
d734492a14 |
btrfs: zoned: use sector_t for zone sectors
We need to use sector_t for zone_sectors, or it would set the zone size
to zero when the size >= 4GB (= 2^24 sectors) by shifting the
zone_sectors value by SECTOR_SHIFT. We're assuming zones sizes up to
8GiB.
Fixes:
|
|
|
|
c28ea613fa |
btrfs: subpage: fix the false data csum mismatch error
[BUG]
When running fstresss, we can hit strange data csum mismatch where the
on-disk data is in fact correct (passes scrub).
With some extra debug info added, we have the following traces:
0482us: btrfs_do_readpage: root=5 ino=284 offset=393216, submit force=0 pgoff=0 iosize=8192
0494us: btrfs_do_readpage: root=5 ino=284 offset=401408, submit force=0 pgoff=8192 iosize=4096
0498us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=393216 len=8192
0591us: btrfs_do_readpage: root=5 ino=284 offset=405504, submit force=0 pgoff=12288 iosize=36864
0594us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=401408 len=4096
0863us: btrfs_submit_data_bio: root=5 ino=284 bio first bvec=405504 len=36864
0933us: btrfs_verify_data_csum: root=5 ino=284 offset=393216 len=8192
0967us: btrfs_do_readpage: root=5 ino=284 offset=442368, skip beyond isize pgoff=49152 iosize=16384
1047us: btrfs_verify_data_csum: root=5 ino=284 offset=401408 len=4096
1163us: btrfs_verify_data_csum: root=5 ino=284 offset=405504 len=36864
1290us: check_data_csum: !!! root=5 ino=284 offset=438272 pg_off=45056 !!!
7387us: end_bio_extent_readpage: root=5 ino=284 before pending_read_bios=0
[CAUSE]
Normally we expect all submitted bio reads to only touch the range we
specified, and under subpage context, it means we should only touch the
range specified in each bvec.
But in data read path, inside end_bio_extent_readpage(), we have page
zeroing which only takes regular page size into consideration.
This means for subpage if we have an inode whose content looks like below:
0 16K 32K 48K 64K
|///////| |///////| |
|//| = data needs to be read from disk
| | = hole
And i_size is 64K initially.
Then the following race can happen:
T1 | T2
--------------------------------+--------------------------------
btrfs_do_readpage() |
|- isize = 64K; |
| At this time, the isize is |
| 64K |
| |
|- submit_extent_page() |
| submit previous assembled bio|
| assemble bio for [0, 16K) |
| |
|- submit_extent_page() |
submit read bio for [0, 16K) |
assemble read bio for |
[32K, 48K) |
|
| btrfs_setsize()
| |- i_size_write(, 16K);
| Now i_size is only 16K
end_io() for [0K, 16K) |
|- end_bio_extent_readpage() |
|- btrfs_verify_data_csum() |
| No csum error |
|- i_size = 16K; |
|- zero_user_segment(16K, |
PAGE_SIZE); |
!!! We zeroed range |
!!! [32K, 48K) |
| end_io for [32K, 48K)
| |- end_bio_extent_readpage()
| |- btrfs_verify_data_csum()
| ! CSUM MISMATCH !
| ! As the range is zeroed now !
[FIX]
To fix the problem, make end_bio_extent_readpage() to only zero the
range of bvec.
The bug only affects subpage read-write support, as for full read-only
mount we can't change i_size thus won't hit the race condition.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
fd57a98d6f |
btrfs: fix warning when creating a directory with smack enabled
When we have smack enabled, during the creation of a directory smack may attempt to add a "smack transmute" xattr on the inode, which results in the following warning and trace: WARNING: CPU: 3 PID: 2548 at fs/btrfs/transaction.c:537 start_transaction+0x489/0x4f0 Modules linked in: nft_objref nf_conntrack_netbios_ns (...) CPU: 3 PID: 2548 Comm: mkdir Not tainted 5.9.0-rc2smack+ #81 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:start_transaction+0x489/0x4f0 Code: e9 be fc ff ff (...) RSP: 0018:ffffc90001887d10 EFLAGS: 00010202 RAX: ffff88816f1e0000 RBX: 0000000000000201 RCX: 0000000000000003 RDX: 0000000000000201 RSI: 0000000000000002 RDI: ffff888177849000 RBP: ffff888177849000 R08: 0000000000000001 R09: 0000000000000004 R10: ffffffff825e8f7a R11: 0000000000000003 R12: ffffffffffffffe2 R13: 0000000000000000 R14: ffff88803d884270 R15: ffff8881680d8000 FS: 00007f67317b8440(0000) GS:ffff88817bcc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f67247a22a8 CR3: 000000004bfbc002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? slab_free_freelist_hook+0xea/0x1b0 ? trace_hardirqs_on+0x1c/0xe0 btrfs_setxattr_trans+0x3c/0xf0 __vfs_setxattr+0x63/0x80 smack_d_instantiate+0x2d3/0x360 security_d_instantiate+0x29/0x40 d_instantiate_new+0x38/0x90 btrfs_mkdir+0x1cf/0x1e0 vfs_mkdir+0x14f/0x200 do_mkdirat+0x6d/0x110 do_syscall_64+0x2d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f673196ae6b Code: 8b 05 11 (...) RSP: 002b:00007ffc3c679b18 EFLAGS: 00000246 ORIG_RAX: 0000000000000053 RAX: ffffffffffffffda RBX: 00000000000001ff RCX: 00007f673196ae6b RDX: 0000000000000000 RSI: 00000000000001ff RDI: 00007ffc3c67a30d RBP: 00007ffc3c67a30d R08: 00000000000001ff R09: 0000000000000000 R10: 000055d3e39fe930 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffc3c679cd8 R14: 00007ffc3c67a30d R15: 00007ffc3c679ce0 irq event stamp: 11029 hardirqs last enabled at (11037): [<ffffffff81153fe6>] console_unlock+0x486/0x670 hardirqs last disabled at (11044): [<ffffffff81153c01>] console_unlock+0xa1/0x670 softirqs last enabled at (8864): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 softirqs last disabled at (8851): [<ffffffff81e0102f>] asm_call_on_stack+0xf/0x20 This happens because at btrfs_mkdir() we call d_instantiate_new() while holding a transaction handle, which results in the following call chain: btrfs_mkdir() trans = btrfs_start_transaction(root, 5); d_instantiate_new() smack_d_instantiate() __vfs_setxattr() btrfs_setxattr_trans() btrfs_start_transaction() start_transaction() WARN_ON() --> a tansaction start has TRANS_EXTWRITERS set in its type h->orig_rsv = h->block_rsv h->block_rsv = NULL btrfs_end_transaction(trans) Besides the warning triggered at start_transaction, we set the handle's block_rsv to NULL which may cause some surprises later on. So fix this by making btrfs_setxattr_trans() not start a transaction when we already have a handle on one, stored in current->journal_info, and use that handle. We are good to use the handle because at btrfs_mkdir() we did reserve space for the xattr and the inode item. Reported-by: Casey Schaufler <casey@schaufler-ca.com> CC: stable@vger.kernel.org # 5.4+ Acked-by: Casey Schaufler <casey@schaufler-ca.com> Tested-by: Casey Schaufler <casey@schaufler-ca.com> Link: https://lore.kernel.org/linux-btrfs/434d856f-bd7b-4889-a6ec-e81aaebfa735@schaufler-ca.com/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4d14c5cde5 |
btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from btrfs_delayed_inode_reserve_metadata can result in flushing delalloc while holding a transaction and delayed node locks. This is deadlock prone. In the past multiple commits: * |
|
|
|
80e9baed72 |
btrfs: export and rename qgroup_reserve_meta
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0f9c03d824 |
btrfs: free correct amount of space in btrfs_delayed_inode_reserve_metadata
Following commit |
|
|
|
c55a4319c4 |
btrfs: fix spurious free_space_tree remount warning
The intended logic of the check is to catch cases where the desired free_space_tree setting doesn't match the mounted setting, and the remount is anything but ro->rw. However, it makes the mistake of checking equality on a masked integer (btrfs_test_opt) against a boolean (btrfs_fs_compat_ro). If you run the reproducer: $ mount -o space_cache=v2 dev mnt $ mount -o remount,ro mnt you would expect no warning, because the remount is not attempting to change the free space tree setting, but we do see the warning. To fix this, add explicit bool type casts to the condition. I tested a variety of transitions: sudo mount -o space_cache=v2 /dev/vg0/lv0 mnt/lol (fst enabled) mount -o remount,ro mnt/lol (no warning, no fst change) sudo mount -o remount,rw,space_cache=v1,clear_cache (no warning, ro->rw) sudo mount -o remount,rw,space_cache=v2 mnt (warning, rw->rw with change) sudo mount -o remount,ro mnt (no warning, no fst change) sudo mount -o remount,rw,space_cache=v2 mnt (no warning, no fst change) Reported-by: Chris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 5.11 Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5011c5a663 |
btrfs: validate qgroup inherit for SNAP_CREATE_V2 ioctl
The problem is we're copying "inherit" from user space but we don't
necessarily know that we're copying enough data for a 64 byte
struct. Then the next problem is that 'inherit' has a variable size
array at the end, and we have to verify that array is the size we
expected.
Fixes:
|
|
|
|
4f6a49de64 |
btrfs: unlock extents in btrfs_zero_range in case of quota reservation errors
If btrfs_qgroup_reserve_data returns an error (i.e quota limit reached)
the handling logic directly goes to the 'out' label without first
unlocking the extent range between lockstart, lockend. This results in
deadlocks as other processes try to lock the same extent.
Fixes:
|
|
|
|
aedb9d9089 |
btrfs: ref-verify: use 'inline void' keyword ordering
Fix build warnings of function signature when CONFIG_STACKTRACE is not enabled by reordering the 'inline' and 'void' keywords. ../fs/btrfs/ref-verify.c:221:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static void inline __save_stack_trace(struct ref_action *ra) ../fs/btrfs/ref-verify.c:225:1: warning: ‘inline’ is not at beginning of declaration [-Wold-style-declaration] static void inline __print_stack_trace(struct btrfs_fs_info *fs_info, Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7a7fd0de4a |
Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull kmap conversion updates from David Sterba:
"This contains changes regarding kmap API use and eg conversion from
kmap_atomic to kmap_local_page.
The API belongs to memory management but to save cross-tree
dependency headaches we've agreed to take it through the btrfs tree
because there are some trivial conversions possible, while the rest
will need some time and getting the easy cases out of the way would be
convenient.
The changes can be grouped:
- function exports, new helpers
- new VM_BUG_ON for additional verification; it's been discussed if
it should be VM_BUG_ON or BUG_ON, the former was chosen due to
performance reasons
- code replaced by relevant helpers"
[ This is an updated version of a request that originally came in during
the merge window, but I asked for some updates:
https://lore.kernel.org/lkml/cover.1614090658.git.dsterba@suse.com/
which is why this got merge after the merge window closed. - Linus ]
* 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use copy_highpage() instead of 2 kmaps()
btrfs: use memcpy_[to|from]_page() and kmap_local_page()
mm/highmem: Add VM_BUG_ON() to mem*_page() calls
mm/highmem: Introduce memcpy_page(), memmove_page(), and memset_page()
mm/highmem: Convert memcpy_[to|from]_page() to kmap_local_page()
mm/highmem: Lift memcpy_[to|from]_page to core
|
|
|
|
c608aca57d |
for-5.12-rc1-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmA85UwACgkQxWXV+ddt
WDsdeA/8DXM6pMGaLkYcvkGvR53/vWwQlKq+i+3zuc41fYFJ7k+DQ7/K5hDbEMoM
E7YsksoRlNVruH/ZvSdtx1exQ/tNrTdqPuds/UR31lIvS2NX9OZZToGWoC8VmrNw
eS9yAwz/7JKUBA6MlMxZFv89OJoHUX9brPSeZVA8hOo3jDr5LXVm0IBskYOBUDRx
JIvt+lkJLKMXPWxwUt3hbkbFPAUQVxYYavhJhWiXT9gdxF+eRgjMI0EN43vBMN2y
kZtoZGeWR64heo9ehFzYMDlAVyph/loGovQ7m6XVzkk5DQGitg0vs3iAG46WjEXt
jxt0ZKmJQwJb3/zNPd8VlLMhULGc56jcq8uhaC2pXjhy18p7EAXml+fH51BExLYK
11hiWtWsrbTsZuYgr6fpqVFukkL/yyH/s7iCWT8Wn+AoPg2fUD99F5nkKT2T0Sso
t7MyJVlTdq8avWbTB+8kFx8+Hy1TsRz3Ic2Zpm8+F3KeVflrb31jJIp3cxPCdfUp
fWX+7VDjKVt00Ti7uP0fAaFO4hn2FjYcWzR3KOjomWox+8LVxB8PbD4H8jD7As2a
5gGGOULmkiZej7hcP6J6zvnmgZIVAGPsSGSVfZtPh4VGiycL3DozcD0x5QerLchR
NZDyIBh2KGE0cRr+cjkPxDyeqfGXQ7VUjp13CBriCkER8SOmBdw=
=QJEy
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"This is the first batch of fixes that usually arrive during the merge
window code freeze. Regressions and stable material.
Regressions:
- fix deadlock in log sync in zoned mode
- fix bugs in subpage mode still wrongly assuming sectorsize == page
size
Fixes:
- fix missing kunmap of the Q stripe in RAID6
- block group fixes:
- fix race between extent freeing/allocation when using bitmaps
- avoid double put of block group when emptying cluster
- swapfile fixes:
- fix swapfile writes vs running scrub
- fix swapfile activation vs snapshot creation
- fix stale data exposure after cloning a hole with NO_HOLES enabled
- remove tree-checker check that does not work in case information
from other leaves is necessary"
* tag 'for-5.12-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix deadlock on log sync
btrfs: avoid double put of block group when emptying cluster
btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
btrfs: tree-checker: do not error out if extent ref hash doesn't match
btrfs: fix race between swap file activation and snapshot creation
btrfs: fix race between writes to swap files and scrub
btrfs: avoid checking for RO block group twice during nocow writeback
btrfs: fix race between extent freeing/allocation when using bitmaps
btrfs: make check_compressed_csum() to be subpage compatible
btrfs: make btrfs_submit_compressed_read() subpage compatible
btrfs: fix raid6 qstripe kmap
|
|
|
|
80cc838423 |
btrfs: use copy_highpage() instead of 2 kmaps()
There are many places where kmap/memove/kunmap patterns occur. This pattern exists in the core common function copy_highpage(). Use copy_highpage to avoid open coding the use of kmap and leverages the core functions use of kmap_local_page(). Development of this patch was aided by the following coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/copypage/kunmap pattern and replace with copy_highpage calls // // NOTE: The expressions in the copy page version of this kmap pattern are // overly complex and so these all need individual attention. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // Then a copy_page where we have 2 pages involved. // @ copy_page_rule @ expression page, page2, To, From, Size; identifier ptr, ptr2; type VP, VP2; @@ /* kmap */ ( -VP ptr = kmap(page); ... -VP2 ptr2 = kmap(page2); | -VP ptr = kmap_atomic(page); ... -VP2 ptr2 = kmap_atomic(page2); | -ptr = kmap(page); ... -ptr2 = kmap(page2); | -ptr = kmap_atomic(page); ... -ptr2 = kmap_atomic(page2); ) // 1 or more copy versions of the entire page <+... ( -copy_page(To, From); +copy_highpage(To, From); | -memmove(To, From, Size); +memmoveExtra(To, From, Size); ) ...+> /* kunmap */ ( -kunmap(page2); ... -kunmap(page); | -kunmap(page); ... -kunmap(page2); | -kmap_atomic(ptr2); ... -kmap_atomic(ptr); ) // Remove any pointers left unused @ depends on copy_page_rule @ identifier copy_page_rule.ptr; identifier copy_page_rule.ptr2; type VP, VP1; type VP2, VP21; @@ -VP ptr; ... when != ptr; ? VP1 ptr; -VP2 ptr2; ... when != ptr2; ? VP21 ptr2; // </smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3590ec5899 |
btrfs: use memcpy_[to|from]_page() and kmap_local_page()
There are many places where the pattern kmap/memcpy/kunmap occurs. This pattern was lifted to the core common functions memcpy_[to|from]_page(). Use these new functions to reduce the code, eliminate direct uses of kmap, and leverage the new core functions use of kmap_local_page(). Also, there is 1 place where a kmap/memcpy is followed by an optional memset. Here we leave the kmap open coded to avoid remapping the page but use kmap_local_page() directly. Development of this patch was aided by the coccinelle script: // <smpl> // SPDX-License-Identifier: GPL-2.0-only // Find kmap/memcpy/kunmap pattern and replace with memcpy*page calls // // NOTE: Offsets and other expressions may be more complex than what the script // will automatically generate. Therefore a catchall rule is provided to find // the pattern which then must be evaluated by hand. // // Confidence: Low // Copyright: (C) 2021 Intel Corporation // URL: http://coccinelle.lip6.fr/ // Comments: // Options: // // simple memcpy version // @ memcpy_rule1 @ expression page, T, F, B, Off; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( -memcpy(ptr + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(ptr, F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, ptr + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, ptr, B); +memcpy_from_page(T, page, 0, B); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule1 @ identifier memcpy_rule1.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // // Some callers kmap without a temp pointer // @ memcpy_rule2 @ expression page, T, Off, F, B; @@ <+... ( -memcpy(kmap(page) + Off, F, B); +memcpy_to_page(page, Off, F, B); | -memcpy(kmap(page), F, B); +memcpy_to_page(page, 0, F, B); | -memcpy(T, kmap(page) + Off, B); +memcpy_from_page(T, page, Off, B); | -memcpy(T, kmap(page), B); +memcpy_from_page(T, page, 0, B); ) ...+> -kunmap(page); // No need for the ptr variable removal // // Catch all // @ memcpy_rule3 @ expression page; expression GenTo, GenFrom, GenSize; identifier ptr; type VP; @@ ( -VP ptr = kmap(page); | -ptr = kmap(page); | -VP ptr = kmap_atomic(page); | -ptr = kmap_atomic(page); ) <+... ( // // Some call sites have complex expressions within the memcpy // match a catch all to be evaluated by hand. // -memcpy(GenTo, GenFrom, GenSize); +memcpy_to_pageExtra(page, GenTo, GenFrom, GenSize); +memcpy_from_pageExtra(GenTo, page, GenFrom, GenSize); ) ...+> ( -kunmap(page); | -kunmap_atomic(ptr); ) // Remove any pointers left unused @ depends on memcpy_rule3 @ identifier memcpy_rule3.ptr; type VP, VP1; @@ -VP ptr; ... when != ptr; ? VP1 ptr; // <smpl> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
87fa0f3eb2 |
mm/filemap: rename generic_file_buffered_read to filemap_read
Rename generic_file_buffered_read to match the naming of filemap_fault, also update the written parameter to a more descriptive name and improve the kerneldoc comment. Link: https://lkml.kernel.org/r/20210122160140.223228-18-willy@infradead.org Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Kent Overstreet <kent.overstreet@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> |
|
|
|
7d6beb71da |
idmapped-mounts-v5.12
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYCegywAKCRCRxhvAZXjc
ouJ6AQDlf+7jCQlQdeKKoN9QDFfMzG1ooemat36EpRRTONaGuAD8D9A4sUsG4+5f
4IU5Lj9oY4DEmF8HenbWK2ZHsesL2Qg=
=yPaw
-----END PGP SIGNATURE-----
Merge tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull idmapped mounts from Christian Brauner:
"This introduces idmapped mounts which has been in the making for some
time. Simply put, different mounts can expose the same file or
directory with different ownership. This initial implementation comes
with ports for fat, ext4 and with Christoph's port for xfs with more
filesystems being actively worked on by independent people and
maintainers.
Idmapping mounts handle a wide range of long standing use-cases. Here
are just a few:
- Idmapped mounts make it possible to easily share files between
multiple users or multiple machines especially in complex
scenarios. For example, idmapped mounts will be used in the
implementation of portable home directories in
systemd-homed.service(8) where they allow users to move their home
directory to an external storage device and use it on multiple
computers where they are assigned different uids and gids. This
effectively makes it possible to assign random uids and gids at
login time.
- It is possible to share files from the host with unprivileged
containers without having to change ownership permanently through
chown(2).
- It is possible to idmap a container's rootfs and without having to
mangle every file. For example, Chromebooks use it to share the
user's Download folder with their unprivileged containers in their
Linux subsystem.
- It is possible to share files between containers with
non-overlapping idmappings.
- Filesystem that lack a proper concept of ownership such as fat can
use idmapped mounts to implement discretionary access (DAC)
permission checking.
- They allow users to efficiently changing ownership on a per-mount
basis without having to (recursively) chown(2) all files. In
contrast to chown (2) changing ownership of large sets of files is
instantenous with idmapped mounts. This is especially useful when
ownership of a whole root filesystem of a virtual machine or
container is changed. With idmapped mounts a single syscall
mount_setattr syscall will be sufficient to change the ownership of
all files.
- Idmapped mounts always take the current ownership into account as
idmappings specify what a given uid or gid is supposed to be mapped
to. This contrasts with the chown(2) syscall which cannot by itself
take the current ownership of the files it changes into account. It
simply changes the ownership to the specified uid and gid. This is
especially problematic when recursively chown(2)ing a large set of
files which is commong with the aforementioned portable home
directory and container and vm scenario.
- Idmapped mounts allow to change ownership locally, restricting it
to specific mounts, and temporarily as the ownership changes only
apply as long as the mount exists.
Several userspace projects have either already put up patches and
pull-requests for this feature or will do so should you decide to pull
this:
- systemd: In a wide variety of scenarios but especially right away
in their implementation of portable home directories.
https://systemd.io/HOME_DIRECTORY/
- container runtimes: containerd, runC, LXD:To share data between
host and unprivileged containers, unprivileged and privileged
containers, etc. The pull request for idmapped mounts support in
containerd, the default Kubernetes runtime is already up for quite
a while now: https://github.com/containerd/containerd/pull/4734
- The virtio-fs developers and several users have expressed interest
in using this feature with virtual machines once virtio-fs is
ported.
- ChromeOS: Sharing host-directories with unprivileged containers.
I've tightly synced with all those projects and all of those listed
here have also expressed their need/desire for this feature on the
mailing list. For more info on how people use this there's a bunch of
talks about this too. Here's just two recent ones:
https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
https://fosdem.org/2021/schedule/event/containers_idmap/
This comes with an extensive xfstests suite covering both ext4 and
xfs:
https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts
It covers truncation, creation, opening, xattrs, vfscaps, setid
execution, setgid inheritance and more both with idmapped and
non-idmapped mounts. It already helped to discover an unrelated xfs
setgid inheritance bug which has since been fixed in mainline. It will
be sent for inclusion with the xfstests project should you decide to
merge this.
In order to support per-mount idmappings vfsmounts are marked with
user namespaces. The idmapping of the user namespace will be used to
map the ids of vfs objects when they are accessed through that mount.
By default all vfsmounts are marked with the initial user namespace.
The initial user namespace is used to indicate that a mount is not
idmapped. All operations behave as before and this is verified in the
testsuite.
Based on prior discussions we want to attach the whole user namespace
and not just a dedicated idmapping struct. This allows us to reuse all
the helpers that already exist for dealing with idmappings instead of
introducing a whole new range of helpers. In addition, if we decide in
the future that we are confident enough to enable unprivileged users
to setup idmapped mounts the permission checking can take into account
whether the caller is privileged in the user namespace the mount is
currently marked with.
The user namespace the mount will be marked with can be specified by
passing a file descriptor refering to the user namespace as an
argument to the new mount_setattr() syscall together with the new
MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
of extensibility.
The following conditions must be met in order to create an idmapped
mount:
- The caller must currently have the CAP_SYS_ADMIN capability in the
user namespace the underlying filesystem has been mounted in.
- The underlying filesystem must support idmapped mounts.
- The mount must not already be idmapped. This also implies that the
idmapping of a mount cannot be altered once it has been idmapped.
- The mount must be a detached/anonymous mount, i.e. it must have
been created by calling open_tree() with the OPEN_TREE_CLONE flag
and it must not already have been visible in the filesystem.
The last two points guarantee easier semantics for userspace and the
kernel and make the implementation significantly simpler.
By default vfsmounts are marked with the initial user namespace and no
behavioral or performance changes are observed.
The manpage with a detailed description can be found here:
|
|
|
|
6e37d24599 |
btrfs: zoned: fix deadlock on log sync
Lockdep with fstests test case btrfs/041 detected a unsafe locking
scenario when we allocate the log node on a zoned filesystem.
btrfs/041
============================================
WARNING: possible recursive locking detected
5.11.0-rc7+ #939 Not tainted
--------------------------------------------
xfs_io/698 is trying to acquire lock:
ffff88810cd673a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x3d1/0xee0 [btrfs]
but task is already holding lock:
ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(&root->log_mutex);
lock(&root->log_mutex);
*** DEADLOCK ***
May be due to missing lock nesting notation
2 locks held by xfs_io/698:
#0: ffff88810cd66620 (sb_internal){.+.+}-{0:0}, at: btrfs_sync_file+0x2c3/0x570 [btrfs]
#1: ffff88810b0fc3a0 (&root->log_mutex){+.+.}-{3:3}, at: btrfs_sync_log+0x313/0xee0 [btrfs]
stack backtrace:
CPU: 0 PID: 698 Comm: xfs_io Not tainted 5.11.0-rc7+ #939
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
dump_stack+0x77/0x97
__lock_acquire.cold+0xb9/0x32a
lock_acquire+0xb5/0x400
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
__mutex_lock+0x7b/0x8d0
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
? btrfs_sync_log+0x3d1/0xee0 [btrfs]
? find_first_extent_bit+0x9f/0x100 [btrfs]
? __mutex_unlock_slowpath+0x35/0x270
btrfs_sync_log+0x3d1/0xee0 [btrfs]
btrfs_sync_file+0x3a8/0x570 [btrfs]
__x64_sys_fsync+0x34/0x60
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This happens, because we are taking the ->log_mutex albeit it has already
been locked.
Also while at it, fix the bogus unlock of the tree_log_mutex in the error
handling.
Fixes:
|
|
|
|
95c85fba1f |
btrfs: avoid double put of block group when emptying cluster
It's wrong calling btrfs_put_block_group in
__btrfs_return_cluster_to_free_space if the block group passed is
different than the block group the cluster represents. As this means the
cluster doesn't have a reference to the passed block group. This results
in double put and a use-after-free bug.
Fix this by simply bailing if the block group we passed in does not
match the block group on the cluster.
Fixes:
|
|
|
|
3660d0bcdb |
btrfs: fix stale data exposure after cloning a hole with NO_HOLES enabled
When using the NO_HOLES feature, if we clone a file range that spans only
a hole into a range that is at or beyond the current i_size of the
destination file, we end up not setting the full sync runtime flag on the
inode. As a result, if we then fsync the destination file and have a power
failure, after log replay we can end up exposing stale data instead of
having a hole for that range.
The conditions for this to happen are the following:
1) We have a file with a size of, for example, 1280K;
2) There is a written (non-prealloc) extent for the file range from 1024K
to 1280K with a length of 256K;
3) This particular file extent layout is durably persisted, so that the
existing superblock persisted on disk points to a subvolume root where
the file has that exact file extent layout and state;
4) The file is truncated to a smaller size, to an offset lower than the
start offset of its last extent, for example to 800K. The truncate sets
the full sync runtime flag on the inode;
6) Fsync the file to log it and clear the full sync runtime flag;
7) Clone a region that covers only a hole (implicit hole due to NO_HOLES)
into the file with a destination offset that starts at or beyond the
256K file extent item we had - for example to offset 1024K;
8) Since the clone operation does not find extents in the source range,
we end up in the if branch at the bottom of btrfs_clone() where we
punch a hole for the file range starting at offset 1024K by calling
btrfs_replace_file_extents(). There we end up not setting the full
sync flag on the inode, because we don't know we are being called in
a clone context (and not fallocate's punch hole operation), and
neither do we create an extent map to represent a hole because the
requested range is beyond eof;
9) A further fsync to the file will be a fast fsync, since the clone
operation did not set the full sync flag, and therefore it relies on
modified extent maps to correctly log the file layout. But since
it does not find any extent map marking the range from 1024K (the
previous eof) to the new eof, it does not log a file extent item
for that range representing the hole;
10) After a power failure no hole for the range starting at 1024K is
punched and we end up exposing stale data from the old 256K extent.
Turning this into exact steps:
$ mkfs.btrfs -f -O no-holes /dev/sdi
$ mount /dev/sdi /mnt
# Create our test file with 3 extents of 256K and a 256K hole at offset
# 256K. The file has a size of 1280K.
$ xfs_io -f -s \
-c "pwrite -S 0xab -b 256K 0 256K" \
-c "pwrite -S 0xcd -b 256K 512K 256K" \
-c "pwrite -S 0xef -b 256K 768K 256K" \
-c "pwrite -S 0x73 -b 256K 1024K 256K" \
/mnt/sdi/foobar
# Make sure it's durably persisted. We want the last committed super
# block to point to this particular file extent layout.
sync
# Now truncate our file to a smaller size, falling within a position of
# the second extent. This sets the full sync runtime flag on the inode.
# Then fsync the file to log it and clear the full sync flag from the
# inode. The third extent is no longer part of the file and therefore
# it is not logged.
$ xfs_io -c "truncate 800K" -c "fsync" /mnt/foobar
# Now do a clone operation that only clones the hole and sets back the
# file size to match the size it had before the truncate operation
# (1280K).
$ xfs_io \
-c "reflink /mnt/foobar 256K 1024K 256K" \
-c "fsync" \
/mnt/foobar
# File data before power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1310720
<power fail>
# Mount the fs again to replay the log tree.
$ mount /dev/sdi /mnt
# File data after power failure:
$ od -A d -t x1 /mnt/foobar
0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab
*
0262144 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
0524288 cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd
*
0786432 ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef ef
*
0819200 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
*
1048576 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
*
1310720
The range from 1024K to 1280K should correspond to a hole but instead it
points to stale data, to the 256K extent that should not exist after the
truncate operation.
The issue does not exists when not using NO_HOLES, because for that case
we use file extent items to represent holes, these are found and copied
during the loop that iterates over extents at btrfs_clone(), and that
causes btrfs_replace_file_extents() to be called with a non-NULL
extent_info argument and therefore set the full sync runtime flag on the
inode.
So fix this by making the code that deals with a trailing hole during
cloning, at btrfs_clone(), to set the full sync flag on the inode, if the
range starts at or beyond the current i_size.
A test case for fstests will follow soon.
Backporting notes: for kernel 5.4 the change goes to ioctl.c into
btrfs_clone before the last call to btrfs_punch_hole_range.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
1119a72e22 |
btrfs: tree-checker: do not error out if extent ref hash doesn't match
The tree checker checks the extent ref hash at read and write time to
make sure we do not corrupt the file system. Generally extent
references go inline, but if we have enough of them we need to make an
item, which looks like
key.objectid = <bytenr>
key.type = <BTRFS_EXTENT_DATA_REF_KEY|BTRFS_TREE_BLOCK_REF_KEY>
key.offset = hash(tree, owner, offset)
However if key.offset collide with an unrelated extent reference we'll
simply key.offset++ until we get something that doesn't collide.
Obviously this doesn't match at tree checker time, and thus we error
while writing out the transaction. This is relatively easy to
reproduce, simply do something like the following
xfs_io -f -c "pwrite 0 1M" file
offset=2
for i in {0..10000}
do
xfs_io -c "reflink file 0 ${offset}M 1M" file
offset=$(( offset + 2 ))
done
xfs_io -c "reflink file 0 17999258914816 1M" file
xfs_io -c "reflink file 0 35998517829632 1M" file
xfs_io -c "reflink file 0 53752752058368 1M" file
btrfs filesystem sync
And the sync will error out because we'll abort the transaction. The
magic values above are used because they generate hash collisions with
the first file in the main subvol.
The fix for this is to remove the hash value check from tree checker, as
we have no idea which offset ours should belong to.
Reported-by: Tuomas Lähdekorpi <tuomas.lahdekorpi@gmail.com>
Fixes:
|
|
|
|
dd0734f2a8 |
btrfs: fix race between swap file activation and snapshot creation
When creating a snapshot we check if the current number of swap files, in
the root, is non-zero, and if it is, we error out and warn that we can not
create the snapshot because there are active swap files.
However this is racy because when a task started activation of a swap
file, another task might have started already snapshot creation and might
have seen the counter for the number of swap files as zero. This means
that after the swap file is activated we may end up with a snapshot of the
same root successfully created, and therefore when the first write to the
swap file happens it has to fall back into COW mode, which should never
happen for active swap files.
Basically what can happen is:
1) Task A starts snapshot creation and enters ioctl.c:create_snapshot().
There it sees that root->nr_swapfiles has a value of 0 so it continues;
2) Task B enters btrfs_swap_activate(). It is not aware that another task
started snapshot creation but it did not finish yet. It increments
root->nr_swapfiles from 0 to 1;
3) Task B checks that the file meets all requirements to be an active
swap file - it has NOCOW set, there are no snapshots for the inode's
root at the moment, no file holes, no reflinked extents, etc;
4) Task B returns success and now the file is an active swap file;
5) Task A commits the transaction to create the snapshot and finishes.
The swap file's extents are now shared between the original root and
the snapshot;
6) A write into an extent of the swap file is attempted - there is a
snapshot of the file's root, so we fall back to COW mode and therefore
the physical location of the extent changes on disk.
So fix this by taking the snapshot lock during swap file activation before
locking the extent range, as that is the order in which we lock these
during buffered writes.
Fixes:
|
|
|
|
195a49eaf6 |
btrfs: fix race between writes to swap files and scrub
When we active a swap file, at btrfs_swap_activate(), we acquire the
exclusive operation lock to prevent the physical location of the swap
file extents to be changed by operations such as balance and device
replace/resize/remove. We also call there can_nocow_extent() which,
among other things, checks if the block group of a swap file extent is
currently RO, and if it is we can not use the extent, since a write
into it would result in COWing the extent.
However we have no protection against a scrub operation running after we
activate the swap file, which can result in the swap file extents to be
COWed while the scrub is running and operating on the respective block
group, because scrub turns a block group into RO before it processes it
and then back again to RW mode after processing it. That means an attempt
to write into a swap file extent while scrub is processing the respective
block group, will result in COWing the extent, changing its physical
location on disk.
Fix this by making sure that block groups that have extents that are used
by active swap files can not be turned into RO mode, therefore making it
not possible for a scrub to turn them into RO mode. When a scrub finds a
block group that can not be turned to RO due to the existence of extents
used by swap files, it proceeds to the next block group and logs a warning
message that mentions the block group was skipped due to active swap
files - this is the same approach we currently use for balance.
Fixes:
|
|
|
|
20903032cd |
btrfs: avoid checking for RO block group twice during nocow writeback
During the nocow writeback path, we currently iterate the rbtree of block groups twice: once for checking if the target block group is RO with the call to btrfs_extent_readonly()), and once again for getting a nocow reference on the block group with a call to btrfs_inc_nocow_writers(). Since btrfs_inc_nocow_writers() already returns false when the target block group is RO, remove the call to btrfs_extent_readonly(). Not only we avoid searching the blocks group rbtree twice, it also helps reduce contention on the lock that protects it (specially since it is a spin lock and not a read-write lock). That may make a noticeable difference on very large filesystems, with thousands of allocated block groups. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c17916510 |
btrfs: fix race between extent freeing/allocation when using bitmaps
During allocation the allocator will try to allocate an extent using
cluster policy. Once the current cluster is exhausted it will remove the
entry under btrfs_free_cluster::lock and subsequently acquire
btrfs_free_space_ctl::tree_lock to dispose of the already-deleted entry
and adjust btrfs_free_space_ctl::total_bitmap. This poses a problem
because there exists a race condition between removing the entry under
one lock and doing the necessary accounting holding a different lock
since extent freeing only uses the 2nd lock. This can result in the
following situation:
T1: T2:
btrfs_alloc_from_cluster insert_into_bitmap <holds tree_lock>
if (entry->bytes == 0) if (block_group && !list_empty(&block_group->cluster_list)) {
rb_erase(entry)
spin_unlock(&cluster->lock);
(total_bitmaps is still 4) spin_lock(&cluster->lock);
<doesn't find entry in cluster->root>
spin_lock(&ctl->tree_lock); <goes to new_bitmap label, adds
<blocked since T2 holds tree_lock> <a new entry and calls add_new_bitmap>
recalculate_thresholds <crashes,
due to total_bitmaps
becoming 5 and triggering
an ASSERT>
To fix this ensure that once depleted, the cluster entry is deleted when
both cluster lock and tree locks are held in the allocator (T1), this
ensures that even if there is a race with a concurrent
insert_into_bitmap call it will correctly find the entry in the cluster
and add the new space to it.
CC: <stable@vger.kernel.org> # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
04d4ba4c90 |
btrfs: make check_compressed_csum() to be subpage compatible
Currently check_compressed_csum() completely relies on sectorsize == PAGE_SIZE to do checksum verification for compressed extents. To make it subpage compatible, this patch will: - Do extra calculation for the csum range Since we have multiple sectors inside a page, we need to only hash the range we want, not the full page anymore. - Do sector-by-sector hash inside the page With this patch and previous conversion on btrfs_submit_compressed_read(), now we can read subpage compressed extents properly, and do proper csum verification. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
be6a13613f |
btrfs: make btrfs_submit_compressed_read() subpage compatible
For compressed read, we always submit page read using page size. This doesn't work well with subpage, as for subpage one page can contain several sectors. Such submission will read range out of what we want, and cause problems. Thankfully to make it subpage compatible, we only need to change how the last page of the compressed extent is read. Instead of always adding a full page to the compressed read bio, if we're at the last page, calculate the size using compressed length, so that we only add part of the range into the compressed read bio. Since we are here, also change the PAGE_SIZE used in lookup_extent_mapping() to sectorsize. This modification won't cause any functional change, as lookup_extent_mapping() can handle the case where the search range is larger than found extent range. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d70cef0d46 |
btrfs: fix raid6 qstripe kmap
When a qstripe is required an extra page is allocated and mapped. There
were 3 problems:
1) There is no corresponding call of kunmap() for the qstripe page.
2) There is no reason to map the qstripe page more than once if the
number of bits set in rbio->dbitmap is greater than one.
3) There is no reason to map the parity page and unmap it each time
through the loop.
The page memory can continue to be reused with a single mapping on each
iteration by raid6_call.gen_syndrome() without remapping. So map the
page for the duration of the loop.
Similarly, improve the algorithm by mapping the parity page just 1 time.
Fixes:
|
|
|
|
582cd91f69 |
for-5.12/block-2021-02-17
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmAtmIwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgplzLEAC5O+3rBM8QuiJdo39Yppmuw4hDJ6hOKynP EJQLKQQi0VfXgU+MprGvcbpFYmNbgICvUICQkEzJuk++kPCu/BJtJz0yErQeLgS+ RdXiPV6enbF7iRML5TVRTr1q/z7sJMXcIIJ8Pz/rU/JNfGYExVd0WfnEY9mp1jOt Bl9V+qyTazdP+Ma4+uEPatSayqcdi1rxB5I+7v/sLiOvKZZWkaRZjUZ/mxAjUfvK dBOOPjMygEo3tCLkIyyA6lpLvr1r+SUZhLuebRLEKa3To3TW6RtoG0qwpKmI2iKw ylLeVLB60nM9RUxjflVOfBsHxz1bDg5Ve86y5nCjQd4Jo8x1c4DnecyGE5/Tu8Rg rgbsfD6nFWzhDCvcZT0XrfQ4ZAjIL2IfT+ypQiQ6UlRd3hvIKRmzWMkjuH2svr0u ey9Kq+lYerI4cM0F3W73gzUKdIQOuCzBCYxQuSQQomscBa7FCInyU192dAI9Aj6l Yd06mgKu6qCx6zLv6JfpBqaBHZMwyGE4dmZgPQFuuwO+b4N+Ck3Jm5fzEzw/xIxQ wdo/DlsAl60BXentB6FByGBJaCjVdSymRqN/xNCAbFKCjmr6TLBuXPfg1gYYO7xC VOcVjWe8iN3wWHZab3t2mxMKH9B9B/KKzIhu6TNHSmgtQ5paZPRCBx995pDyRw26 WC22RGC2MA== =os1E -----END PGP SIGNATURE----- Merge tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block Pull core block updates from Jens Axboe: "Another nice round of removing more code than what is added, mostly due to Christoph's relentless pursuit of tech debt removal/cleanups. This pull request contains: - Two series of BFQ improvements (Paolo, Jan, Jia) - Block iov_iter improvements (Pavel) - bsg error path fix (Pan) - blk-mq scheduler improvements (Jan) - -EBUSY discard fix (Jan) - bvec allocation improvements (Ming, Christoph) - bio allocation and init improvements (Christoph) - Store bdev pointer in bio instead of gendisk + partno (Christoph) - Block trace point cleanups (Christoph) - hard read-only vs read-only split (Christoph) - Block based swap cleanups (Christoph) - Zoned write granularity support (Damien) - Various fixes/tweaks (Chunguang, Guoqing, Lei, Lukas, Huhai)" * tag 'for-5.12/block-2021-02-17' of git://git.kernel.dk/linux-block: (104 commits) mm: simplify swapdev_block sd_zbc: clear zone resources for non-zoned case block: introduce blk_queue_clear_zone_settings() zonefs: use zone write granularity as block size block: introduce zone_write_granularity limit block: use blk_queue_set_zoned in add_partition() nullb: use blk_queue_set_zoned() to setup zoned devices nvme: cleanup zone information initialization block: document zone_append_max_bytes attribute block: use bi_max_vecs to find the bvec pool md/raid10: remove dead code in reshape_request block: mark the bio as cloned in bio_iov_bvec_set block: set BIO_NO_PAGE_REF in bio_iov_bvec_set block: remove a layer of indentation in bio_iov_iter_get_pages block: turn the nr_iovecs argument to bio_alloc* into an unsigned short block: remove the 1 and 4 vec bvec_slabs entries block: streamline bvec_alloc block: factor out a bvec_alloc_gfp helper block: move struct biovec_slab to bio.c block: reuse BIO_INLINE_VECS for integrity bvecs ... |
|
|
|
4f016a316f |
New code for 5.12:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN if
the write is not a pure overwrite within EOF; this will be used to
reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for unaligned
direct writes if the circumstances permit
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEUzaAxoMeQq6m2jMV+H93GTRKtOsFAmAZgQAACgkQ+H93GTRK
tOtNqw/+KPff1NjQVK2k361R0+LjlEHfe2nxh7+kS10IiR5nbBz4Fu+GwEosZKq+
H9ficBbZ0wIveV+5CEt2xZLEJFC4LZUpNPVVrUf8XPLKiVexP/U3wtKzmv9Z7D5J
5walMWQycVeR+ycomynV36giqekvARL7KCQG5By2ITfSNxfnb/wvKhn1d61ZDOF6
f4xzq7F6+cEOrSZt2LcFzGSfsTl6oakYMAomPU57sqGmw7MHRqoPTErbdh2HnVJy
yQ47eiZgSKWKA+Qm+VvHHePYCYnu0nvA2rbNerjTN70hnO8rK9S0Vle6Sp5CUqAX
sXOy8zxOLYKqyM4S/QkIN2TGIyWg+CHiakVLZGF3Q4AUDDYfpD0cHvAe9N3v9euL
qt8ypT8dz2C3qiTg5E31xy033wlAP0wg3FZiLAqEjL5o3fzD+qbplTiSmYbMV2Fb
xuu7a2T6u1MHaIn1IhaL0cB49Fzn+5EMyp6BlAucAOakyuqJCyJiXokdk0Looy5e
jUshvcwWcmHMpI/YYYY6t56KV6tl2exGq5sySY5U6dr8/r5lwc0SI+TrYFG0jTR8
59DGd5CkKgdBFcuys+eaZDXgr7A4ymkVE+pE0QNDz9UwNP20tLb3dQNlhgxchUgu
NgPaFgQkoNM3HmQNyU2wX/t1aFlC/doqSkb/96UWQSxq6IrajMU=
=AR07
-----END PGP SIGNATURE-----
Merge tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The big change in this cycle is some new code to make it possible for
XFS to try unaligned directio overwrites without taking locks. If the
block is fully written and within EOF (i.e. doesn't require any
further fs intervention) then we can let the unlocked write proceed.
If not, we fall back to synchronizing direct writes.
Summary:
- Adjust the final parameter of iomap_dio_rw.
- Add a new flag to request that iomap directio writes return EAGAIN
if the write is not a pure overwrite within EOF; this will be used
to reduce lock contention with unaligned direct writes on XFS.
- Amend XFS' directio code to eliminate exclusive locking for
unaligned direct writes if the circumstances permit"
* tag 'iomap-5.12-merge-2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: reduce exclusive locking on unaligned dio
xfs: split the unaligned DIO write code out
xfs: improve the reflink_bounce_dio_write tracepoint
xfs: simplify the read/write tracepoints
xfs: remove the buffered I/O fallback assert
xfs: cleanup the read/write helper naming
xfs: make xfs_file_aio_write_checks IOCB_NOWAIT-aware
xfs: factor out a xfs_ilock_iocb helper
iomap: add a IOMAP_DIO_OVERWRITE_ONLY flag
iomap: pass a flags argument to iomap_dio_rw
iomap: rename the flags variable in __iomap_dio_rw
|
|
|
|
6f3952cbe0 |
for-5.12-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAqyGEACgkQxWXV+ddt
WDuU6BAAhfI5BndMm6a1LooMsBHTR7Mh/aFXZEKX7vCDRnrkr+WiihDFhXu4tH3y
arRsdwMnJCnta2/JMI5xCZZRg9Bsb/Sa0qWoR9sDBVoGRMnE1DS5YHQyv0bfJYk0
qYOW/jorBV1n/hL19+WbDFajwajP86uGtlDKV7cJ/C3lIogQma7zQ7ygwxbDcZqm
ZQVHg7ooM4P1t7EV0eDlatxn0Sm8KFkxXD7dbu37qDLWr3Aw8N4IwT7I9h4b+/tg
hL4dqMPxX6AyRiI0VBsqKnmcRWtT9cN7yw0+J+/JK5KuaFFx3qyZZ+EQu1jAGZDt
2m432YKya8LQfyBuSe8uoCIcczhGoD0EPIhspecDMfWTvxdo+AeTJZzZzj3u1y+v
3pih+gBN1sa8vRVSX08mIBF/k0pPfxRu7gIjvl4wl18bm3Khq5VJ93ImP7DNroNg
bKiUG35K+kvXGBNaLY71zZfO6aLMddK73aDudSbYOS8XcbKhor1G8j5o5/EkcVQA
wio4Gw5BmfVeRuXOl2h1aEXThk+469s0DR7MiMiAA6917cUjQiFUgFOaogR0XY3S
8ffX+S50AFW834J0eIGHPLmzi70WwSSXCS2q+zl87PPRK5+jCp9ZzWGi9MGG1qdh
fp7XVMkzHVSKGK5GXB+ICUfzkShxfTCh+EbxcXIulONxsEdADsc=
=0O6r
-----END PGP SIGNATURE-----
Merge tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings updates of space handling, performance improvements or bug
fixes. The subpage block size and zoned mode features have reached
state where they're usable but with limitations.
Performance or related:
- do not block on deleted block group mutex in the cleaner, avoids
some long stalls
- improved flushing: make it work better with ticket space
reservations and avoid excessive transaction commits in some
scenarios, slightly improves throughput for random write load
- preemptive background flushing: separate the logic from ticket
reservations, improve the accounting and decisions when to flush in
low space conditions
- less lock contention related to running delayed refs, let just one
thread do the flushing when there are many inside transaction
commit
- dbench workload improvements: avoid unnecessary work when logging
inodes, fewer fallbacks to transaction commit and thus less waiting
for it (+7% throughput, -20% latency)
Core:
- subpage block size
- currently read-only support
- refactor and generalize code where sectorsize is assumed to be
page size, add the subpage handling everywhere
- the read-write support is on the way, page sizes are still
limited to 4K or 64K
- zoned mode, first working version but with limitations
- SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed
location for structures" and chunked allocation
- superblock as the only fixed data structure needs special
handling, uses 2 consecutive zones as a ring buffer
- tree-log support with a dedicated block group to avoid unordered
writes
- emulated zones on non-zoned devices
- not yet working
- all non-single block group profiles, requires more zone write
pointer synchronization between the multiple block groups
- fitrim due to dependency on space cache, can be implemented
Fixes:
- ref-verify: proper tree owner and node level tracking
- fix pinned byte accounting, causing some early ENOSPC now more
likely due to other changes in delayed refs
Other:
- error handling fixes and improvements
- more error injection points
- more function documentation
- more and updated tracepoints
- subset of W=1 checked by default
- update comments to allow more automatic kdoc parameter checks"
* tag 'for-5.12-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (144 commits)
btrfs: zoned: enable to mount ZONED incompat flag
btrfs: zoned: deal with holes writing out tree-log pages
btrfs: zoned: reorder log node allocation on zoned filesystem
btrfs: zoned: serialize log transaction on zoned filesystems
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
btrfs: split alloc_log_tree()
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
btrfs: zoned: enable relocation on a zoned filesystem
btrfs: zoned: support dev-replace in zoned filesystems
btrfs: zoned: implement copying for zoned device-replace
btrfs: zoned: implement cloning for zoned device-replace
btrfs: zoned: mark block groups to copy for device-replace
btrfs: zoned: do not use async metadata checksum on zoned filesystems
btrfs: zoned: wait for existing extents before truncating
btrfs: zoned: serialize metadata IO
btrfs: zoned: introduce dedicated data write path for zoned filesystems
btrfs: zoned: enable zone append writing for direct IO
btrfs: zoned: use ZONE_APPEND write for zoned mode
btrfs: save irq flags when looking up an ordered extent
btrfs: zoned: cache if block group is on a sequential zone
...
|
|
|
|
e42ee56fe5 |
for-5.11-rc7-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAmlkAACgkQxWXV+ddt WDuwNxAAiBAhEwPllzyU86p4RMMip5pa24zu11HkTya65yGk6EFuj4zTlx/L5Fn6 JOjxwlPqaTItER1PYJ5HRdIy1Y2E4eWEiDLolvmvDCPZrfKRKhBU1MZbgXwDbp+Z pwaJGIm5ZaXDGyuFge3bKA48BERfqxRBO3qIOZ0tzgsUFLlZ2d9EdDc99093/J6k QzIijXQjFnvnB2MNawN1b/KQ63xqXLo2hemKcKIFCxJHm9eaet/qwGHl5iuR5ScY bOGCWvLSkCXceartDur3msOZXur09YLyfeYmE9dj1FN3aNu97sW8VivWRrs3aglK if51iYrrjKSnDr4SOK28S5UYdgeStb/qWWtosdcMsQVBo0t7iCnGT2psGaQCkdfG FChqbs2uXlbJrojlelV6xbaU3S2D2MtSz5mF+I2G5MpQbj1jkhYE9ZTUQeibcd7o l+edn/VJvVK4X0NAX8pIWJ4nFY1HqUTyfn28IQ7ymBhyyUloIoazvSkBuSWy6iy0 9aPpohOKjCw8Y3MbgcIfIEJhdK+aIKF8ZPh52+zcXQzf1OtSryVarLHsNXWm9vJ8 tHsRHCzrbLFdAXZccT6YlerzPs4+PVf44UknDbFCg7sLcG04NIGGrMXOtTHwgEZL BEywTjAMlMDjrEXouxYAPNPnEg/NlvQGZYRvBnxrtZE4G2fxJ7o= =7w6G -----END PGP SIGNATURE----- Merge tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A regression fix caused by a refactoring in 5.11. A corrupted superblock wouldn't be detected by checksum verification due to wrongly placed initialization of the checksum length, thus making memcmp always work" * tag 'for-5.11-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: initialize fs_info::csum_size earlier in open_ctree |
|
|
|
83c68bbcb6 |
btrfs: initialize fs_info::csum_size earlier in open_ctree
User reported that btrfs-progs misc-tests/028-superblock-recover fails:
[TEST/misc] 028-superblock-recover
unexpected success: mounted fs with corrupted superblock
test failed for case 028-superblock-recover
The test case expects that a broken image with bad superblock will be
rejected to be mounted. However, the test image just passed csum check
of superblock and was successfully mounted.
Commit
|
|
|
|
9d294a685f |
btrfs: zoned: enable to mount ZONED incompat flag
This final patch adds the ZONED incompat flag to the supported flags and enables to mount ZONED flagged file system. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b528f46713 |
btrfs: zoned: deal with holes writing out tree-log pages
Since the zoned filesystem requires sequential write out of metadata, we cannot proceed with a hole in tree-log pages. When such a hole exists, btree_write_cache_pages() will return -EAGAIN. This happens when someone, e.g., a concurrent transaction commit, writes a dirty extent in this tree-log commit. If we are not going to wait for the extents, we can hope the concurrent writing fills the hole for us. So, we can ignore the error in this case and hope the next write will succeed. If we want to wait for them and got the error, we cannot wait for them because it will cause a deadlock. So, let's bail out to a full commit in this case. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3ddebf27fc |
btrfs: zoned: reorder log node allocation on zoned filesystem
This is the 3/3 patch to enable tree-log on zoned filesystems. The allocation order of nodes of "fs_info->log_root_tree" and nodes of "root->log_root" is not the same as the writing order of them. So, the writing causes unaligned write errors. Reorder the allocation of them by delaying allocation of the root node of "fs_info->log_root_tree," so that the node buffers can go out sequentially to devices. Cc: Filipe Manana <fdmanana@gmail.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fa1a0f42a0 |
btrfs: zoned: serialize log transaction on zoned filesystems
This is the 2/3 patch to enable tree-log on zoned filesystems. Since we can start more than one log transactions per subvolume simultaneously, nodes from multiple transactions can be allocated interleaved. Such mixed allocation results in non-sequential writes at the time of a log transaction commit. The nodes of the global log root tree (fs_info->log_root_tree), also have the same problem with mixed allocation. Serializes log transactions by waiting for a committing transaction when someone tries to start a new transaction, to avoid the mixed allocation problem. We must also wait for running log transactions from another subvolume, but there is no easy way to detect which subvolume root is running a log transaction. So, this patch forbids starting a new log transaction when other subvolumes already allocated the global log root tree. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
40ab3be102 |
btrfs: zoned: extend zoned allocator to use dedicated tree-log block group
This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6ab6ebb760 |
btrfs: split alloc_log_tree()
This is a preparation patch for the next patch. Split alloc_log_tree() into two parts. The first one allocating the tree structure, remains in alloc_log_tree() and the second part allocating the tree node, which is moved into btrfs_alloc_log_tree_node(). Also export the latter part is to be used in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f7ef5287a6 |
btrfs: zoned: relocate block group to repair IO failure in zoned filesystems
When a bad checksum is found and if the filesystem has a mirror of the
damaged data, we read the correct data from the mirror and writes it to
damaged blocks. This however, violates the sequential write constraints
of a zoned block device.
We can consider three methods to repair an IO failure in zoned filesystems:
(1) Reset and rewrite the damaged zone
(2) Allocate new device extent and replace the damaged device extent to
the new extent
(3) Relocate the corresponding block group
Method (1) is most similar to a behavior done with regular devices.
However, it also wipes non-damaged data in the same device extent, and
so it unnecessary degrades non-damaged data.
Method (2) is much like device replacing but done in the same device. It
is safe because it keeps the device extent until the replacing finish.
However, extending device replacing is non-trivial. It assumes
"src_dev->physical == dst_dev->physical". Also, the extent mapping
replacing function should be extended to support replacing device extent
position in one device.
Method (3) invokes relocation of the damaged block group and is
straightforward to implement. It relocates all the mirrored device
extents, so it potentially is a more costly operation than method (1) or
(2). But it relocates only used extents which reduce the total IO size.
Let's apply method (3) for now. In the future, we can extend device-replace
and apply method (2).
For protecting a block group gets relocated multiple time with multiple
IO errors, this commit introduces "relocating_repair" bit to show it's
now relocating to repair IO failures. Also it uses a new kthread
"btrfs-relocating-repair", not to block IO path with relocating process.
This commit also supports repairing in the scrub process.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
32430c6148 |
btrfs: zoned: enable relocation on a zoned filesystem
Currently fallocate() is disabled on a zoned filesystem. Since current relocation process relies on preallocation to move file data extents, it must be handled differently. On a zoned filesystem, we just truncate the inode to the size that we wanted to pre-allocate. Then, we flush dirty pages on the file before finishing the relocation process. run_delalloc_zoned() will handle all the allocations and submit IOs to the underlying layers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7db1c5d14d |
btrfs: zoned: support dev-replace in zoned filesystems
This is 4/4 patch to implement device-replace on zoned filesystems. Even after the copying is done, the write pointers of the source device and the destination device may not be synchronized. For example, when the last allocated extent is freed before device-replace process, the extent is not copied, leaving a hole there. Synchronize the write pointers by writing zeroes to the destination device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
de17addce7 |
btrfs: zoned: implement copying for zoned device-replace
This is 3/4 patch to implement device-replace on zoned filesystems.
This commit implements copying. To do this, it tracks the write pointer
during the device replace process. As device-replace's copy process is
smart enough to only copy used extents on the source device, we have to
fill the gap to honor the sequential write requirement in the target
device.
The device-replace process on zoned filesystems must copy or clone all
the extents in the source device exactly once. So, we need to ensure
allocations started just before the dev-replace process to have their
corresponding extent information in the B-trees.
finish_extent_writes_for_zoned() implements that functionality, which
basically is the removed code in the commit
|
|
|
|
6143c23ccc |
btrfs: zoned: implement cloning for zoned device-replace
This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
78ce9fc269 |
btrfs: zoned: mark block groups to copy for device-replace
This is the 1/4 patch to support device-replace on zoned filesystems.
We have two types of IOs during the device replace process. One is an IO
to "copy" (by the scrub functions) all the device extents from the source
device to the destination device. The other one is an IO to "clone" (by
handle_ops_on_dev_replace()) new incoming write IOs from users to the
source device into the target device.
Cloning incoming IOs can break the sequential write rule in on target
device. When a write is mapped in the middle of a block group, the IO is
directed to the middle of a target device zone, which breaks the
sequential write requirement.
However, the cloning function cannot be disabled since incoming IOs
targeting already copied device extents must be cloned so that the IO is
executed on the target device.
We cannot use dev_replace->cursor_{left,right} to determine whether a bio
is going to a not yet copied region. Since we have a time gap between
finishing btrfs_scrub_dev() and rewriting the mapping tree in
btrfs_dev_replace_finishing(), we can have a newly allocated device extent
which is never cloned nor copied.
So the point is to copy only already existing device extents. This patch
introduces mark_block_group_to_copy() to mark existing block groups as a
target of copying. Then, handle_ops_on_dev_replace() and dev-replace can
check the flag to do their job.
Also, btrfs_finish_block_group_to_copy() will check if the copied stripe
is the last stripe in the block group. With the last stripe copied,
the to_copy flag is finally disabled. Afterwards we can safely clone
incoming IOs on this block group.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
4eef29ef63 |
btrfs: zoned: do not use async metadata checksum on zoned filesystems
On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize the metadata write IOs. Even with this serialization, write bios sent from btree_write_cache_pages can be reordered by async checksum workers as these workers are per CPU and not per zone. To preserve write bio ordering, we disable async metadata checksum on a zoned filesystem. This does not result in lower performance with HDDs as a single CPU core is fast enough to do checksum for a single zone write stream with the maximum possible bandwidth of the device. If multiple zones are being written simultaneously, HDD seek overhead lowers the achievable maximum bandwidth, resulting again in a per zone checksum serialization not affecting the performance. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
24c0a7227f |
btrfs: zoned: wait for existing extents before truncating
When truncating a file, file buffers which have already been allocated but not yet written may be truncated. Truncating these buffers could cause breakage of a sequential write pattern in a block group if the truncated blocks are for example followed by blocks allocated to another file. To avoid this problem, always wait for write out of all unwritten buffers before proceeding with the truncate execution. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0bc09ca129 |
btrfs: zoned: serialize metadata IO
We cannot use zone append for writing metadata, because the B-tree nodes have references to each other using logical address. Without knowing the address in advance, we cannot construct the tree in the first place. So we need to serialize write IOs for metadata. We cannot add a mutex around allocation and submission because metadata blocks are allocated in an earlier stage to build up B-trees. Add a zoned_meta_io_lock and hold it during metadata IO submission in btree_write_cache_pages() to serialize IOs. Furthermore, this adds a per-block group metadata IO submission pointer "meta_write_pointer" to ensure sequential writing, which can break when attempting to write back blocks in an unfinished transaction. If the writing out failed because of a hole and the write out is for data integrity (WB_SYNC_ALL), it returns EAGAIN. A caller like fsync() code should handle this properly e.g. by falling back to a full transaction commit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
42c0110009 |
btrfs: zoned: introduce dedicated data write path for zoned filesystems
If more than one IO is issued for one file extent, these IO can be written to separate regions on a device. Since we cannot map one file extent to such a separate area on a zoned filesystem, we need to follow the "one IO == one ordered extent" rule. The normal buffered, uncompressed and not pre-allocated write path (used by cow_file_range()) sometimes does not follow this rule. It can write a part of an ordered extent when specified a region to write e.g., when its called from fdatasync(). Introduce a dedicated (uncompressed buffered) data write path for zoned filesystems, that will COW the region and write it at once. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
544d24f9de |
btrfs: zoned: enable zone append writing for direct IO
Likewise to buffered IO, enable zone append writing for direct IO when its used on a zoned block device. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d8e3fb106f |
btrfs: zoned: use ZONE_APPEND write for zoned mode
Enable zone append writing for zoned mode. When using zone append, a bio is issued to the start of a target zone and the device decides to place it inside the zone. Upon completion the device reports the actual written position back to the host. Three parts are necessary to enable zone append mode. First, modify the bio to use REQ_OP_ZONE_APPEND in btrfs_submit_bio_hook() and adjust the bi_sector to point the beginning of the zone. Second, record the returned physical address (and disk/partno) to the ordered extent in end_bio_extent_writepage() after the bio has been completed. We cannot resolve the physical address to the logical address because we can neither take locks nor allocate a buffer in this end_bio context. So, we need to record the physical address to resolve it later in btrfs_finish_ordered_io(). And finally, rewrite the logical addresses of the extent mapping and checksum data according to the physical address using btrfs_rmap_block. If the returned address matches the originally allocated address, we can skip this rewriting process. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
24533f6a9a |
btrfs: save irq flags when looking up an ordered extent
A following patch will add another caller of btrfs_lookup_ordered_extent(), but from a bio's endio context. btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally disables interrupts. Change this to spin_lock_irqsave() so interrupts aren't disabled and re-enabled unconditionally. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
08f455593f |
btrfs: zoned: cache if block group is on a sequential zone
On a zoned filesystem, cache if a block group is on a sequential write only zone. On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing data, therefore provide btrfs_use_zone_append() to figure out if IO is targeting a sequential write only zone and we can use REQ_OP_ZONE_APPEND for data writing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
138082f366 |
btrfs: extend btrfs_rmap_block for specifying a device
btrfs_rmap_block currently reverse-maps the physical addresses on all devices to the corresponding logical addresses. Extend the function to match to a specified device. The old functionality of querying all devices is left intact by specifying NULL as target device. A block_device instead of a btrfs_device is passed into btrfs_rmap_block, as this function is intended to reverse-map the result of a bio, which only has a block_device. Also export the function for later use. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cacb2cea46 |
btrfs: zoned: check if bio spans across an ordered extent
To ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Ensure that constructing bio does not span more than an ordered extent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d22002fd37 |
btrfs: zoned: split ordered extent when bio is sent
For a zone append write, the device decides the location the data is being written to. Therefore we cannot ensure that two bios are written consecutively on the device. In order to ensure that an ordered extent maps to a contiguous region on disk, we need to maintain a "one bio == one ordered extent" rule. Implement splitting of an ordered extent and extent map on bio submission to adhere to the rule. extract_ordered_extent() hooks into btrfs_submit_data_bio() and splits the corresponding ordered extent so that the ordered extent's region fits into one bio and the corresponding device limits. Several sanity checks need to be done in extract_ordered_extent() e.g. - We cannot split once end_bio'd ordered extent because we cannot divide ordered->bytes_left for the split ones - We do not expect a compressed ordered extent - We should not have checksum list because we omit the list splitting. Since the function is called before btrfs_wq_submit_bio() or btrfs_csum_one_bio(), this should be always ensured. We also need to split an extent map by creating a new one. If not, unpin_extent_cache() complains about the difference between the start of the extent map and the file's logical offset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cfe94440d1 |
btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing
Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual devices. Let btrfs_end_bio() and btrfs_op be aware of it, by mapping REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of bio_op(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e1326f0339 |
btrfs: zoned: use bio_add_zone_append_page
A zoned device has its own hardware restrictions e.g. max_zone_append_size when using REQ_OP_ZONE_APPEND. To follow these restrictions, use bio_add_zone_append_page() instead of bio_add_page(). We need target device to use bio_add_zone_append_page(), so this commit reads the chunk information to cache the target device to btrfs_io_bio(bio)->device. Caching only the target device is sufficient here as zoned filesystems only supports the single profile at the moment. Once more profiles will be supported btrfs_io_bio can hold an extent_map to be able to check for the restrictions of all devices the btrfs_bio will be mapped to. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
953651eb30 |
btrfs: factor out helper adding a page to bio
Factor out adding a page to a bio from submit_extent_page(). The page is added only when bio_flags are the same, contiguous and the added page fits in the same stripe as pages in the bio. Condition checks are reordered to allow early return to avoid possibly heavy btrfs_bio_fits_in_stripe() calling. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dcba6e48b5 |
btrfs: zoned: reset zones of unused block groups
We must reset the zones of a deleted unused block group to rewind the zones' write pointers to the zones' start. To do this, we can use the DISCARD_SYNC code to do the reset when the filesystem is running on zoned devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
011b41bffa |
btrfs: zoned: advance allocation pointer after tree log node
Since the allocation info of a tree log node is not recorded in the extent tree, calculate_alloc_pointer() cannot detect this node, so the pointer can be over a tree node. Replaying the log calls btrfs_remove_free_space() for each node in the log tree. So, advance the pointer after the node to not allocate over it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d3575156f6 |
btrfs: zoned: redirty released extent buffers
Tree manipulating operations like merging nodes often release once-allocated tree nodes. Such nodes are cleaned so that pages in the node are not uselessly written out. On zoned volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. Introduce a list of clean and unwritten extent buffers that have been released in a transaction. Redirty the buffers so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. 'btrfs check'. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2eda57089e |
btrfs: zoned: implement sequential extent allocation
Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
169e0da91a |
btrfs: zoned: track unusable bytes for zones
In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a94794d50d |
btrfs: zoned: calculate allocation offset for conventional zones
Conventional zones do not have a write pointer, so we cannot use it to determine the allocation offset for sequential allocation if a block group contains a conventional zone. But instead, we can consider the end of the highest addressed extent in the block group for the allocation offset. For new block group, we cannot calculate the allocation offset by consulting the extent tree, because it can cause deadlock by taking extent buffer lock after chunk mutex, which is already taken in btrfs_make_block_group(). Since it is a new block group anyways, we can simply set the allocation offset to 0. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
08e11a3db0 |
btrfs: zoned: load zone's allocation offset
A zoned filesystem must allocate blocks at the zones' write pointer. The device's write pointer position can be mapped to a logical address within a block group. To facilitate this, add an "alloc_offset" to the block-group to track the logical addresses of the write pointer. This logical address is populated in btrfs_load_block_group_zone_info() from the write pointers of corresponding zones. For now, zoned filesystems the single profile. Supporting non-single profile with zone append writing is not trivial. For example, in the DUP profile, we send a zone append writing IO to two zones on a device. The device reply with written LBAs for the IOs. If the offsets of the returned addresses from the beginning of the zone are different, then it results in different logical addresses. We need fine-grained logical to physical mapping to support such separated physical address issue. Since it should require additional metadata type, disable non-single profiles for now. This commit supports the case all the zones in a block group are sequential. The next patch will handle the case having a conventional zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
381a696eb5 |
btrfs: zoned: verify device extent is aligned to zone
Add a check in verify_one_dev_extent() to ensure that a device extent on a zoned block device is aligned to the respective zone boundary. If it isn't, mark the filesystem as unclean. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1cd6121f2a |
btrfs: zoned: implement zoned chunk allocator
Implement a zoned chunk and device extent allocator. One device zone becomes a device extent so that a zone reset affects only this device extent and does not change the state of blocks in the neighbor device extents. To implement the allocator, we need to extend the following functions for a zoned filesystem. - init_alloc_chunk_ctl - dev_extent_search_start - dev_extent_hole_check - decide_stripe_size init_alloc_chunk_ctl_zoned() is mostly the same as regular one. It always set the stripe_size to the zone size and aligns the parameters to the zone size. dev_extent_search_start() only aligns the start offset to zone boundaries. We don't care about the first 1MB like in regular filesystem because we anyway reserve the first two zones for superblock logging. dev_extent_hole_check_zoned() checks if zones in given hole are either conventional or empty sequential zones. Also, it skips zones reserved for superblock logging. With the change to the hole, the new hole may now contain pending extents. So, in this case, loop again to check that. Finally, decide_stripe_size_zoned() should shrink the number of devices instead of stripe size because we need to honor stripe_size == zone_size. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c9daa09cc |
btrfs: zoned: allow zoned filesystems on non-zoned block devices
Run a zoned filesystem on non-zoned devices. This is done by "slicing up" the block device into static sized chunks and fake a conventional zone on each of them. The emulated zone size is determined from the size of device extent. This is mainly aimed at testing of zoned filesystems, i.e. the zoned chunk allocator, on regular block devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1cb3dc3f79 |
btrfs: zoned: disallow fitrim on zoned filesystems
The implementation of fitrim depends on space cache, which is not used and disabled for zoned extent allocator. So the current code does not work with zoned filesystem. In the future, we can implement fitrim for zoned filesystems by enabling space cache (but, only for fitrim) or scanning the extent tree at fitrim time. For now, disallow fitrim on zoned filesystems. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b53429bad3 |
btrfs: zoned: do not load fs_info::zoned from incompat flag
Don't set the zoned flag in fs_info as soon as we're encountering the incompat filesystem flag for a zoned filesystem on mount. The zoned flag in fs_info is in a union together with the zone_size, so setting it too early will result in setting an incorrect zone_size as well. Once the correct zone_size is read from the device, we can rely on the zoned flag in fs_info as well to determine if the filesystem is zoned. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4afd2fe835 |
btrfs: release path before calling to btrfs_load_block_group_zone_info
Since we have no write pointer in conventional zones, we cannot determine the allocation offset from it. Instead, we set the allocation offset after the highest addressed extent. This is done by reading the extent tree in btrfs_load_block_group_zone_info(). However, this function is called from btrfs_read_block_groups(), so the read lock for the tree node could be recursively taken. To avoid this unsafe locking scenario, release the path before reading the extent tree to get the allocation offset. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d6639b35da |
btrfs: zoned: use regular super block location on zone emulation
A zoned filesystem currently has a superblock at the beginning of the superblock logging zones if the zones are conventional. This difference in superblock position causes a chicken-and-egg problem for filesystems with emulated zones. Since the device is a regular (non-zoned) device, we cannot know if the filesystem is regular or zoned while reading the superblock. But, to load the superblock, we need to see if it is emulated zoned or not. Place the superblocks at the same location as they are on regular filesystem on regular devices to solve the problem. It is possible because it's ensured that all the superblock locations are at an (emulated) conventional zone on regular devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7365104236 |
btrfs: zoned: defer loading zone info after opening trees
This is a preparation patch to implement zone emulation on a regular device. To emulate a zoned filesystem on a regular (non-zoned) device, we need to decide an emulated zone size. Instead of making it a compile-time static value, we'll make it configurable at mkfs time. Since we have one zone == one device extent restriction, we can determine the emulated zone size from the size of a device extent. We can extend btrfs_get_dev_zone_info() to show a regular device filled with conventional zones once the zone size is decided. The current call site of btrfs_get_dev_zone_info() during the mount process is earlier than loading the file system trees so that we don't know the size of a device extent at this point. Thus we can't slice a regular device to conventional zones. This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone info for all the devices. And, it places this function in open_ctree() after loading the trees. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
72c9925f87 |
btrfs: fix extent buffer leak on failure to copy root
At btrfs_copy_root(), if the call to btrfs_inc_ref() fails we end up
returning without unlocking and releasing our reference on the extent
buffer named "cow" we previously allocated with btrfs_alloc_tree_block().
So fix that by unlocking the extent buffer and dropping our reference on
it before returning.
Fixes:
|
|
|
|
2c4d8cb737 |
btrfs: explain page locking and readahead in read_extent_buffer_pages()
In read_extent_buffer_pages(), if we failed to lock the page atomically, we just exit with return value 0. This is counter-intuitive, as normally if we can't lock what we need, we would return something like EAGAIN. But that return hides under (wait == WAIT_NONE) branch, which only gets triggered for readahead. And for readahead, if we failed to lock the page, it means the extent buffer is either being read by other thread, or has been read and is under modification. Either way the eb will or has been cached, thus readahead has no need to wait for it. Add comment on this counter-intuitive behavior. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0bb3eb3ee8 |
btrfs: allow read-only mount of 4K sector size fs on 64K page system
This adds the basic RO mount ability for 4K sector size on 64K page system. Currently we only plan to support 4K and 64K page system. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
92082d4097 |
btrfs: integrate page status update for data read path into begin/end_page_read
In btrfs data page read path, the page status update are handled in two
different locations:
btrfs_do_read_page()
{
while (cur <= end) {
/* No need to read from disk */
if (HOLE/PREALLOC/INLINE){
memset();
set_extent_uptodate();
continue;
}
/* Read from disk */
ret = submit_extent_page(end_bio_extent_readpage);
}
end_bio_extent_readpage()
{
endio_readpage_uptodate_page_status();
}
This is fine for sectorsize == PAGE_SIZE case, as for above loop we
should only hit one branch and then exit.
But for subpage, there is more work to be done in page status update:
- Page Unlock condition
Unlike regular page size == sectorsize case, we can no longer just
unlock a page.
Only the last reader of the page can unlock the page.
This means, we can unlock the page either in the while() loop, or in
the endio function.
- Page uptodate condition
Since we have multiple sectors to read for a page, we can only mark
the full page uptodate if all sectors are uptodate.
To handle both subpage and regular cases, introduce a pair of functions
to help handling page status update:
- begin_page_read()
For regular case, it does nothing.
For subpage case, it updates the reader counters so that later
end_page_read() can know who is the last one to unlock the page.
- end_page_read()
This is just endio_readpage_uptodate_page_status() renamed.
The original name is a little too long and too specific for endio.
The new thing added is the condition for page unlock.
Now for subpage data, we unlock the page if we're the last reader.
This does not only provide the basis for subpage data read, but also
hide the special handling of page read from the main read loop.
Also, since we're changing how the page lock is handled, there are two
existing error paths where we need to manually unlock the page before
calling begin_page_read().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
32443de338 |
btrfs: introduce btrfs_subpage for data inodes
To support subpage sector size, data also need extra info to make sure which sectors in a page are uptodate/dirty/... This patch will make pages for data inodes get btrfs_subpage structure attached, and detached when the page is freed. This patch also slightly changes the timing when set_page_extent_mapped() is called to make sure: - We have page->mapping set page->mapping->host is used to grab btrfs_fs_info, thus we can only call this function after page is mapped to an inode. One call site attaches pages to inode manually, thus we have to modify the timing of set_page_extent_mapped() a bit. - As soon as possible, before other operations Since memory allocation can fail, we have to do extra error handling. Calling set_page_extent_mapped() as soon as possible can simply the error handling for several call sites. The idea is pretty much the same as iomap_page, but with more bitmaps for btrfs specific cases. Currently the plan is to switch iomap if iomap can provide sector aligned write back (only write back dirty sectors, but not the full page, data balance require this feature). So we will stick to btrfs specific bitmap for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
371cdc0700 |
btrfs: introduce subpage metadata validation check
For subpage metadata validation check, there are some differences: - Read must finish in one bvec Since we're just reading one subpage range in one page, it should never be split into two bios nor two bvecs. - How to grab the existing eb Instead of grabbing eb using page->private, we have to go search radix tree as we don't have any direct pointer at hand. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4325cb2293 |
btrfs: support subpage in endio_readpage_update_page_status()
To handle subpage status update, add the following: - Use btrfs_page_*() subpage-aware helpers to update page status Now we can handle both cases well. - No page unlock for subpage metadata Since subpage metadata doesn't utilize page locking at all, skip it. For subpage data locking, it's handled in later commits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4012daf769 |
btrfs: introduce read_extent_buffer_subpage()
Introduce a helper, read_extent_buffer_subpage(), to do the subpage extent buffer read. The difference between regular and subpage routines are: - No page locking Here we completely rely on extent locking. Page locking can reduce the concurrency greatly, as if we lock one page to read one extent buffer, all the other extent buffers in the same page will have to wait. - Extent uptodate condition Despite the existing PageUptodate() and EXTENT_BUFFER_UPTODATE check, We also need to check btrfs_subpage::uptodate_bitmap. - No page iteration Just one page, no need to loop, this greatly simplified the subpage routine. This patch only implements the bio submit part, no endio support yet. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d1e86e3fc3 |
btrfs: support subpage in try_release_extent_buffer()
Unlike the original try_release_extent_buffer(), try_release_subpage_extent_buffer() will iterate through all the ebs in the page, and try to release each. We can release the full page only after there's no private attached, which means all ebs of that page have been released as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
92d83e9436 |
btrfs: support subpage in btrfs_clone_extent_buffer
For btrfs_clone_extent_buffer(), it's mostly the same code of __alloc_dummy_extent_buffer(), except it has extra page copy. So to make it subpage compatible, we only need to: - Call set_extent_buffer_uptodate() instead of SetPageUptodate() This will set correct uptodate bit for subpage and regular sector size cases. Since we're calling set_extent_buffer_uptodate() which will also set EXTENT_BUFFER_UPTODATE bit, we don't need to manually set that bit either. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
251f2acc71 |
btrfs: support subpage in set/clear_extent_buffer_uptodate()
To support subpage in set_extent_buffer_uptodate and clear_extent_buffer_uptodate we only need to use the subpage-aware helpers to update the page bits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
03a816b32b |
btrfs: introduce helpers for subpage error status
Introduce the following functions to handle subpage error status: - btrfs_subpage_set_error() - btrfs_subpage_clear_error() - btrfs_subpage_test_error() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_error() - btrfs_page_clear_error() - btrfs_page_test_error() These helpers can handle both regular sector size and subpage without problem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a1d767c11c |
btrfs: introduce helpers for subpage uptodate status
Introduce the following functions to handle subpage uptodate status: - btrfs_subpage_set_uptodate() - btrfs_subpage_clear_uptodate() - btrfs_subpage_test_uptodate() These helpers can only be called when the page has subpage attached and the range is ensured to be inside the page. - btrfs_page_set_uptodate() - btrfs_page_clear_uptodate() - btrfs_page_test_uptodate() These helpers can handle both regular sector size and subpage. Although caller should still ensure that the range is inside the page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
09bc1f0fb8 |
btrfs: attach private to dummy extent buffer pages
There are locations where we allocate dummy extent buffers for temporary usage, like in tree_mod_log_rewind() or get_old_root(). These dummy extent buffers will be handled by the same eb accessors, and if they don't have page::private subpage eb accessors could fail. To address such problems, make __alloc_dummy_extent_buffer() attach page private for dummy extent buffers too. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8ff8466d29 |
btrfs: support subpage for extent buffer page release
In btrfs_release_extent_buffer_pages(), we need to add extra handling for subpage. Introduce a helper, detach_extent_buffer_page(), to do different handling for regular and subpage cases. For subpage case, handle detaching page private. For unmapped (dummy or cloned) ebs, we can detach the page private immediately as the page can only be attached to one unmapped eb. For mapped ebs, we have to ensure there are no eb in the page range before we delete it, as page->private is shared between all ebs in the same page. But there is a subpage specific race, where we can race with extent buffer allocation, and clear the page private while new eb is still being utilized, like this: Extent buffer A is the new extent buffer which will be allocated, while extent buffer B is the last existing extent buffer of the page. T1 (eb A) | T2 (eb B) -------------------------------+------------------------------ alloc_extent_buffer() | btrfs_release_extent_buffer_pages() |- p = find_or_create_page() | | |- attach_extent_buffer_page() | | | | |- detach_extent_buffer_page() | | |- if (!page_range_has_eb()) | | | No new eb in the page range yet | | | As new eb A hasn't yet been | | | inserted into radix tree. | | |- btrfs_detach_subpage() | | |- detach_page_private(); |- radix_tree_insert() | Then we have a metadata eb whose page has no private bit. To avoid such race, we introduce a subpage metadata-specific member, btrfs_subpage::eb_refs. In alloc_extent_buffer() we increase eb_refs in the critical section of private_lock. Then page_range_has_eb() will return true for detach_extent_buffer_page(), and will not detach page private. The section is marked by: - btrfs_page_inc_eb_refs() - btrfs_page_dec_eb_refs() Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
819822107d |
btrfs: make grab_extent_buffer_from_page() handle subpage case
For subpage case, grab_extent_buffer() can't really get an extent buffer just from btrfs_subpage. We have radix tree lock protecting us from inserting the same eb into the tree. Thus we don't really need to do the extra hassle, just let alloc_extent_buffer() handle the existing eb in radix tree. Now if two ebs are being allocated as the same time, one will fail with -EEIXST when inserting into the radix tree. So for grab_extent_buffer(), just always return NULL for subpage case. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
760f991f14 |
btrfs: make attach_extent_buffer_page() handle subpage case
For subpage case, we need to allocate additional memory for each metadata page. So we need to: - Allow attach_extent_buffer_page() to return int to indicate allocation failure - Allow manually pre-allocate subpage memory for alloc_extent_buffer() As we don't want to use GFP_ATOMIC under spinlock, we introduce btrfs_alloc_subpage() and btrfs_free_subpage() functions for this purpose. (The simple wrap for btrfs_free_subpage() is for later convert to kmem_cache. Already internally tested without problem) - Preallocate btrfs_subpage structure for alloc_extent_buffer() We don't want to call memory allocation with spinlock held, so do preallocation before we acquire mapping->private_lock. - Handle subpage and regular case differently in attach_extent_buffer_page() For regular case, no change, just do the usual thing. For subpage case, allocate new memory or use the preallocated memory. For future subpage metadata, we will make use of radix tree to grab extent buffer. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cac06d843f |
btrfs: introduce the skeleton of btrfs_subpage structure
For sectorsize < page size support, we need a structure to record extra status info for each sector of a page. Introduce the skeleton structure, all subpage related code would go to subpage.[ch]. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
62c053fbb2 |
btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support
For the incoming subpage support, UNMAPPED extent buffer will have different behavior in btrfs_release_extent_buffer(). This means we need to set UNMAPPED bit early before calling btrfs_release_extent_buffer(). Currently there is only one caller which relies on btrfs_release_extent_buffer() in its error path while set UNMAPPED bit late: - btrfs_clone_extent_buffer() Make it subpage compatible by setting the UNMAPPED bit early, since we're here, also move the UPTODATE bit early. There is another caller, __alloc_dummy_extent_buffer(), setting UNMAPPED bit late, but that function clean up the allocated page manually, thus no need for any modification. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6869b0a8be |
btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK
PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK are two defines used in __process_pages_contig(), to let the function know to clear page dirty bit and then set page writeback. However page writeback and dirty bits are conflicting (at least for sector size == PAGE_SIZE case), this means these two have to be always updated together. This means we can merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d0c2f4fa55 |
btrfs: make concurrent fsyncs wait less when waiting for a transaction commit
Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX |
|
|
|
64d6b281ba |
btrfs: remove unnecessary check_parent_dirs_for_sync()
Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0e44cb3f94 |
btrfs: skip logging inodes already logged when logging new entries
When logging new directory entries of a directory, we log the inodes of new dentries and the inodes of dentries pointing to directories that may have been created in past transactions. For the case of directories we log in full mode, which can be particularly expensive for large directories. We do use btrfs_inode_in_log() to skip already logged inodes, however for that helper to return true, it requires that the log transaction used to log the inode to be already committed. This means that when we have more than one task using the same log transaction we can end up logging an inode multiple times, which is a waste of time and not necessary since the log will be committed by one of the tasks and the others will wait for the log transaction to be committed before returning to user space. So simply replace the use of btrfs_inode_in_log() with the new helper function need_log_inode(), introduced in a previous commit. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3e6a86a193 |
btrfs: skip logging directories already logged when logging all parents
Some times when we fsync an inode we need to do a full log of all its ancestors (due to unlink, link or rename operations), which can be an expensive operation, specially if the directories are large. However if we find an ancestor directory inode that is already logged in the current transaction, and has no inserted/updated/deleted xattrs since it was last logged, we can skip logging the directory again. We are safe to skip that since we know that for logged directories, any link, unlink or rename operations that implicate the directory will update the log as necessary. So use the helper need_log_dir(), introduced in a previous commit, to detect already logged directories that can be skipped. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ab12313a9f |
btrfs: avoid logging new ancestor inodes when logging new inode
When we fsync a new file, created in the current transaction, we check all its ancestor inodes and always log them if they were created in the current transaction - even if we have already logged them before, which is a waste of time. So avoid logging new ancestor inodes if they were already logged before and have no xattrs added/updated/removed since they were last logged. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e593e54ed1 |
btrfs: stop setting nbytes when filling inode item for logging
When we fill an inode item for logging we are setting its nbytes field with the value returned by inode_get_bytes() (a VFS API), however we do not need it because it is not used during log replay. In fact, for fast fsyncs, when we call inode_get_bytes() we may even get an outdated value for nbytes because the nbytes field of the inode is only updated when ordered extents complete, and a fast fsync only waits for writeback to complete, it does not wait for ordered extent completion. So just remove the setup of nbytes and add an explicit comment mentioning why we do not set it. This also avoids adding contention on the inode's i_lock (VFS) with concurrent stat() calls, since that spinlock is used by inode_get_bytes() which is also called by our stat callback (btrfs_getattr()). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ddffcf6fb5 |
btrfs: remove unnecessary directory inode item update when deleting dir entry
When we remove a directory entry, as part of an unlink operation, if the
directory was logged before we must remove the directory index items from
the log. We are also updating the inode item of the directory to update
its i_size, but that is not necessary because during log replay we do not
need it and we correctly adjust the i_size in the inode item of the
subvolume as we process directory index items and replay deletes.
This is not needed since commit
|
|
|
|
4203431319 |
btrfs: let callers of btrfs_get_io_geometry pass the em
Before this change, the btrfs_get_io_geometry() function was calling btrfs_get_chunk_map() to get the extent mapping, necessary for calculating the I/O geometry. It was using that extent mapping only internally and freeing the pointer after its execution. That resulted in calling btrfs_get_chunk_map() de facto twice by the __btrfs_map_block() function. It was calling btrfs_get_io_geometry() first and then calling btrfs_get_chunk_map() directly to get the extent mapping, used by the rest of the function. Change that to passing the extent mapping to the btrfs_get_io_geometry() function as an argument. This could improve performance in some cases. For very large filesystems, i.e. several thousands of allocated chunks, not only this avoids searching two times the rbtree, saving time, it may also help reducing contention on the lock that protects the tree - thinking of writeback starting for multiple inodes, other tasks allocating or removing chunks, and anything else that requires access to the rbtree. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Michal Rostecki <mrostecki@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add Filipe's analysis ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
951c80f83d |
btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge
Commit |
|
|
|
a4559e6f6f |
btrfs: simplify condition in __btrfs_run_delayed_items
Fix the following coccicheck warnings: ./fs/btrfs/delayed-inode.c:1157:39-41: WARNING !A || A && B is equivalent to !A || B. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Suggested-by: Jiapeng Zhong <oswb@linux.alibaba.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Abaci Team <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2965194b77 |
btrfs: remove wrong comment for can_nocow_extent()
The comment for can_nocow_extent() says that the function will flush
ordered extents, however that never happens and was never true before the
comment was added in commit
|
|
|
|
e5ad49e215 |
btrfs: add a trace class for dumping the current ENOSPC state
Often when I'm debugging ENOSPC related issues I have to resort to printing the entire ENOSPC state with trace_printk() in different spots. This gets pretty annoying, so add a trace state that does this for us. Then add a trace point at the end of preemptive flushing so you can see the state of the space_info when we decide to exit preemptive flushing. This helped me figure out we weren't kicking in the preemptive flushing soon enough. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4b02b00fe5 |
btrfs: adjust the flush trace point to include the source
Since we have normal ticketed flushing and preemptive flushing, adjust the tracepoint so that we know the source of the flushing action to make it easier to debug problems. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
88a777a6e5 |
btrfs: implement space clamping for preemptive flushing
Starting preemptive flushing at 50% of available free space is a good start, but some workloads are particularly abusive and can quickly overwhelm the preemptive flushing code and drive us into using tickets. Handle this by clamping down on our threshold for starting and continuing to run preemptive flushing. This is particularly important for our overcommit case, as we can really drive the file system into overages and then it's more difficult to pull it back as we start to actually fill up the file system. The clamping is essentially 2^CLAMP, but we start at 1 so whatever we calculate for overcommit is the baseline. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2e294c6049 |
btrfs: simplify the logic in need_preemptive_flushing
A lot of this was added all in one go with no explanation, and is a bit unwieldy and confusing. Simplify the logic to start preemptive flushing if we've reserved more than half of our available free space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9f42d37748 |
btrfs: rework btrfs_calc_reclaim_metadata_size
Currently btrfs_calc_reclaim_metadata_size does two things, it returns the space currently required for flushing by the tickets, and if there are no tickets it calculates a value for the preemptive flushing. However for the normal ticketed flushing we really only care about the space required for tickets. We will accidentally come in and flush one time, but as soon as we see there are no tickets we bail out of our flushing. Fix this by making btrfs_calc_reclaim_metadata_size really only tell us what is required for flushing if we have people waiting on space. Then move the preemptive flushing logic into need_preemptive_reclaim(). We ignore btrfs_calc_reclaim_metadata_size() in need_preemptive_reclaim() because if we are in this path then we made our reservation and there are not pending tickets currently, so we do not need to check it, simply do the fuzzy logic to check if we're getting low on space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f205edf773 |
btrfs: check reclaim_size in need_preemptive_reclaim
If we're flushing space for tickets then we have space_info->reclaim_size set and we do not need to do background reclaim. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ae7913ba52 |
btrfs: rename need_do_async_reclaim
All of our normal flushing is asynchronous reclaim, so this helper is poorly named. This is more checking if we need to preemptively flush space, so rename it to need_preemptive_reclaim. Also switch it to bool and make it plain static as followup patches will move more code here. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
576fa34830 |
btrfs: improve preemptive background space flushing
Currently if we ever have to flush space because we do not have enough we allocate a ticket and attach it to the space_info, and then systematically flush things in the filesystem that hold space reservations until our space is reclaimed. However this has a latency cost, we must go to sleep and wait for the flushing to make progress before we are woken up and allowed to continue doing our work. In order to address that we used to kick off the async worker to flush space preemptively, so that we could be reclaiming space hopefully before any tasks needed to stop and wait for space to reclaim. When I introduced the ticketed ENOSPC stuff this broke slightly in the fact that we were using tickets to indicate if we were done flushing. No tickets, no more flushing. However this meant that we essentially never preemptively flushed. This caused a write performance regression that Nikolay noticed in an unrelated patch that removed the committing of the transaction during btrfs_end_transaction. The behavior that happened pre that patch was btrfs_end_transaction() would see that we were low on space, and it would commit the transaction. This was bad because in this particular case you could end up with thousands and thousands of transactions being committed during the 5 minute reproducer. With the patch to remove this behavior we got much more sane transaction commits, but we ended up slower because we would write for a while, flush, write for a while, flush again. To address this we need to reinstate a preemptive flushing mechanism. However it is distinctly different from our ticketing flushing in that it doesn't have tickets to base it's decisions on. Instead of bolting this logic into our existing flushing work, add another worker to handle this preemptive flushing. Here we will attempt to be slightly intelligent about the things that we flushing, attempting to balance between whichever pool is taking up the most space. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f00c42dd4c |
btrfs: introduce a FORCE_COMMIT_TRANS flush operation
Solely for preemptive flushing, we want to be able to force the transaction commit without any of the ambiguity of may_commit_transaction(). This is because may_commit_transaction() checks tickets and such, and in preemptive flushing we already know it'll be helpful, so use this to keep the code nice and clean and straightforward. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comment ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5deb17e18e |
btrfs: track ordered bytes instead of just dio ordered bytes
We track dio_bytes because the shrink delalloc code needs to know if we have more DIO in flight than we have normal buffered IO. The reason for this is because we can't "flush" DIO, we have to just wait on the ordered extents to finish. However this is true of all ordered extents. If we have more ordered space outstanding than dirty pages we should be waiting on ordered extents. We already are ok on this front technically, because we always do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in the preemptive flushing code as well, so change this to count all ordered bytes instead of just DIO ordered bytes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ac1ea10e75 |
btrfs: add a trace point for reserve tickets
While debugging a ENOSPC related performance problem I needed to see the time difference between start and end of a reserve ticket, so add a trace point to report when we handle a reserve ticket. I opted to spit out start_ns itself without calculating the difference because there could be a gap between enabling the tracepoint and setting start_ns. Doing it this way allows us to filter on 0 start_ns so we don't get bogus entries, and we can easily calculate the time difference with bpftrace or something else. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
91e79a83ff |
btrfs: make flush_space take a enum btrfs_flush_state instead of int
I got a automated message from somebody who runs clang against our kernels and it's because I used the wrong enum type for what I passed into flush_space, caught by -Wenum-conversion. Change the argument to be explicitly the enum we're expecting to make everything consistent. Maybe eventually gcc will catch errors like this. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8898038309 |
btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb
btrfs_compare_trees and changed_cb use a void *ctx parameter instead of
struct send_ctx *sctx but when used in changed_cb it is immediately
cast to `struct send_ctx *sctx = ctx;`.
changed_cb is only ever called from btrfs_compare_trees and full_send_tree:
- full_send_tree already passes a struct send_ctx *sctx
- btrfs_compare_trees is only called by send_subvol with a struct send_ctx *sctx
- void *ctx in btrfs_compare_trees is only used to be passed to changed_cb
So casting to/from void *ctx seems unnecessary and directly using
struct send_ctx *sctx instead provides better type-safety.
The original reason for using void *ctx in the first place seems to have
been dropped with
|
|
|
|
488bc2a2d2 |
btrfs: run delayed refs less often in commit_cowonly_roots
We love running delayed refs in commit_cowonly_roots, but it is a bit excessive. I was seeing cases of running 3 or 4 refs a few times in a row during this time. Instead simply: - update all of the roots first - then run delayed refs - then handle the empty block groups case - and then if we have any more dirty roots do the whole thing again This allows us to be much more efficient with our delayed ref running, as we can batch a few more operations at once. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dac348e925 |
btrfs: stop running all delayed refs during snapshot
This was added in commit
|
|
|
|
b7774425e0 |
btrfs: remove bogus BUG_ON in alloc_reserved_tree_block
The fix |
|
|
|
2a4d84c11a |
btrfs: move delayed ref flushing for qgroup into qgroup helper
The commit
|
|
|
|
ad368f3394 |
btrfs: only run delayed refs once before committing
We try to pre-flush the delayed refs when committing, because we want to do as little work as possible in the critical section of the transaction commit. However doing this twice can lead to very long transaction commit delays as other threads are allowed to continue to generate more delayed refs, which potentially delays the commit by multiple minutes in very extreme cases. So simply stick to one pre-flush, and then continue the rest of the transaction commit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
61a56a992f |
btrfs: delayed refs pre-flushing should only run the heads we have
Previously our delayed ref running used the total number of items as the items to run. However we changed that to number of heads to run with the delayed_refs_rsv, as generally we want to run all of the operations for one bytenr. But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the number of items that we have. This is generally fine, but if we have some operation generation loads of delayed refs while we're doing this pre-flushing in the transaction commit, we'll just spin forever doing delayed refs. Fix this to simply pick the number of delayed refs we currently have, that way we do not end up doing a lot of extra work that's being generated in other threads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e19eb11f4f |
btrfs: only let one thread pre-flush delayed refs in commit
I've been running a stress test that runs 20 workers in their own subvolume, which are running an fsstress instance with 4 threads per worker, which is 80 total fsstress threads. In addition to this I'm running balance in the background as well as creating and deleting snapshots. This test takes around 12 hours to run normally, going slower and slower as the test goes on. The reason for this is because fsstress is running fsync sometimes, and because we're messing with block groups we often fall through to btrfs_commit_transaction, so will often have 20-30 threads all calling btrfs_commit_transaction at the same time. These all get stuck contending on the extent tree while they try to run delayed refs during the initial part of the commit. This is suboptimal, really because the extent tree is a single point of failure we only want one thread acting on that tree at once to reduce lock contention. Fix this by making the flushing mechanism a bit operation, to make it easy to use test_and_set_bit() in order to make sure only one task does this initial flush. Once we're into the transaction commit we only have one thread doing delayed ref running, it's just this initial pre-flush that is problematic. With this patch my stress test takes around 90 minutes to run, instead of 12 hours. The memory barrier is not necessary for the flushing bit as it's ordered, unlike plain int. The transaction state accessed in btrfs_should_end_transaction could be affected by that too as it's not always used under transaction lock. Upon Nikolay's analysis in [1] it's not necessary: In should_end_transaction it's read without holding any locks. (U) It's modified in btrfs_cleanup_transaction without holding the fs_info->trans_lock (U), but the STATE_ERROR flag is going to be set. set in cleanup_transaction under fs_info->trans_lock (L) set in btrfs_commit_trans to COMMIT_START under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_DOING under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_UNBLOCK under fs_info->trans_lock.(L) set in btrfs_commit_trans to COMMIT_COMPLETED without locks but at this point the transaction is finished and fs_info->running_trans is NULL (U but irrelevant). So by the looks of it we can have a concurrent READ race with a WRITE, due to reads not taking a lock. In this case what we want to ensure is we either see new or old state. I consulted with Will Deacon and he said that in such a case we'd want to annotate the accesses to ->state with (READ|WRITE)_ONCE so as to avoid a theoretical tear, in this case I don't think this could happen but I imagine at some point KCSAN would flag such an access as racy (which it is). [1] https://lore.kernel.org/linux-btrfs/e1fd5cc1-0f28-f670-69f4-e9958b4964e6@suse.com Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add comments regarding memory barrier ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ddfd08cb04 |
btrfs: do not block on deleted bgs mutex in the cleaner
While running some stress tests I started getting hung task messages. This is because the delete unused block groups code has to take the delete_unused_bgs_mutex to do it's work, which is taken by balance to make sure we don't delete block groups while we're balancing. The problem is that balance can take a while, and so we were getting hung task warnings. We don't need to block and run these things, and the cleaner is needed to do other work, so trylock on this mutex and just bail if we can't acquire it right away. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
867ed321f9 |
btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing. However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace:
btrfs_inc_extent_ref
btrfs_copy_root
create_reloc_root
btrfs_init_reloc_root
btrfs_record_root_in_trans
btrfs_start_transaction
btrfs_update_inode
btrfs_update_time
touch_atime
file_accessed
btrfs_file_mmap
This is because we do not catch the error from btrfs_inc_extent_ref,
which in practice would be ENOMEM, which means we lose the extent
references for a root that has already been allocated and inserted,
which is the problem. Fix this by aborting the transaction if we fail
to do the reference modification.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
eddda68d97 |
btrfs: add asserts for deleting backref cache nodes
A weird KASAN problem that Zygo reported could have been easily caught if we checked for basic things in our backref freeing code. We have two methods of freeing a backref node - btrfs_backref_free_node: this just is kfree() essentially. - btrfs_backref_drop_node: this actually unlinks the node and cleans up everything and then calls btrfs_backref_free_node(). We should mostly be using btrfs_backref_drop_node(), to make sure the node is properly unlinked from the backref cache, and only use btrfs_backref_free_node() when we know the node isn't actually linked to the backref cache. We made a mistake here and thus got the KASAN splat. Make this style of issue easier to find by adding some ASSERT()'s to btrfs_backref_free_node() and adjusting our deletion stuff to properly init the list so we can rely on list_empty() checks working properly. BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420 Read of size 8 at addr ffff888112402950 by task btrfs/28836 CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014 Call Trace: dump_stack+0xbc/0xf9 ? btrfs_backref_cleanup_node+0x18a/0x420 print_address_description.constprop.8+0x21/0x210 ? record_print_text.cold.34+0x11/0x11 ? btrfs_backref_cleanup_node+0x18a/0x420 ? btrfs_backref_cleanup_node+0x18a/0x420 kasan_report.cold.10+0x20/0x37 ? btrfs_backref_cleanup_node+0x18a/0x420 __asan_load8+0x69/0x90 btrfs_backref_cleanup_node+0x18a/0x420 btrfs_backref_release_cache+0x83/0x1b0 relocate_block_group+0x394/0x780 ? merge_reloc_roots+0x4a0/0x4a0 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 ? check_flags.part.50+0x6c/0x1e0 ? btrfs_relocate_chunk+0x120/0x120 ? kmem_cache_alloc_trace+0xa06/0xcb0 ? _copy_from_user+0x83/0xc0 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f4c4bdfe427 RSP: 002b:00007fff33ee6df8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 00007fff33ee6e98 RCX: 00007f4c4bdfe427 RDX: 00007fff33ee6e98 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000003 R09: 0000000000000078 R10: fffffffffffff59d R11: 0000000000000202 R12: 0000000000000001 R13: 0000000000000000 R14: 00007fff33ee8a34 R15: 0000000000000001 Allocated by task 28836: kasan_save_stack+0x21/0x50 __kasan_kmalloc.constprop.18+0xbe/0xd0 kasan_kmalloc+0x9/0x10 kmem_cache_alloc_trace+0x410/0xcb0 btrfs_backref_alloc_node+0x46/0xf0 btrfs_backref_add_tree_node+0x60d/0x11d0 build_backref_tree+0xc5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 28836: kasan_save_stack+0x21/0x50 kasan_set_track+0x20/0x30 kasan_set_free_info+0x1f/0x30 __kasan_slab_free+0xf3/0x140 kasan_slab_free+0xe/0x10 kfree+0xde/0x200 btrfs_backref_error_cleanup+0x452/0x530 build_backref_tree+0x1a5/0x700 relocate_tree_blocks+0x2be/0xb90 relocate_block_group+0x2eb/0x780 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x1900 btrfs_ioctl_balance+0x3a7/0x460 btrfs_ioctl+0x24c8/0x4360 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff888112402900 which belongs to the cache kmalloc-128 of size 128 The buggy address is located 80 bytes inside of 128-byte region [ffff888112402900, ffff888112402980) The buggy address belongs to the page: page:0000000028b1cd08 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff888131c810c0 pfn:0x112402 flags: 0x17ffe0000000200(slab) raw: 017ffe0000000200 ffffea000424f308 ffffea0007d572c8 ffff888100040440 raw: ffff888131c810c0 ffff888112402000 0000000100000009 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff888112402800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888112402880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff888112402900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888112402980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888112402a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Link: https://lore.kernel.org/linux-btrfs/20201208194607.GI31381@hungrycats.org/ CC: stable@vger.kernel.org # 5.10+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f78743fbda |
btrfs: do not warn if we can't find the reloc root when looking up backref
The backref code is looking for a reloc_root that corresponds to the given fs root. However any number of things could have gone wrong while initializing that reloc_root, like ENOMEM while trying to allocate the root itself, or EIO while trying to write the root item. This would result in no corresponding reloc_root being in the reloc root cache, and thus would return NULL when we do the find_reloc_root() call. Because of this we do not want to WARN_ON(). This presumably was meant to catch developer errors, cases where we messed up adding the reloc root. However we can easily hit this case with error injection, and thus should not do a WARN_ON(). CC: stable@vger.kernel.org # 5.10+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
938fcbfb0c |
btrfs: splice remaining dirty_bg's onto the transaction dirty bg list
While doing error injection testing with my relocation patches I hit the following assert: assertion failed: list_empty(&block_group->dirty_list), in fs/btrfs/block-group.c:3356 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3357! invalid opcode: 0000 [#1] SMP NOPTI CPU: 0 PID: 24351 Comm: umount Tainted: G W 5.10.0-rc3+ #193 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x1a RSP: 0018:ffffa09b019c7e00 EFLAGS: 00010282 RAX: 0000000000000056 RBX: ffff8f6492c18000 RCX: 0000000000000000 RDX: ffff8f64fbc27c60 RSI: ffff8f64fbc19050 RDI: ffff8f64fbc19050 RBP: ffff8f6483bbdc00 R08: 0000000000000000 R09: 0000000000000000 R10: ffffa09b019c7c38 R11: ffffffff85d70928 R12: ffff8f6492c18100 R13: ffff8f6492c18148 R14: ffff8f6483bbdd70 R15: dead000000000100 FS: 00007fbfda4cdc40(0000) GS:ffff8f64fbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fbfda666fd0 CR3: 000000013cf66002 CR4: 0000000000370ef0 Call Trace: btrfs_free_block_groups.cold+0x55/0x55 close_ctree+0x2c5/0x306 ? fsnotify_destroy_marks+0x14/0x100 generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1b1/0x1d0 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happened because I injected an error in btrfs_cow_block() while running the dirty block groups. When we run the dirty block groups, we splice the list onto a local list to process. However if an error occurs, we only cleanup the transactions dirty block group list, not any pending block groups we have on our locally spliced list. In fact if we fail to allocate a path in this function we'll also fail to clean up the splice list. Fix this by splicing the list back onto the transaction dirty block group list so that the block groups are cleaned up. Then add a 'out' label and have the error conditions jump to out so that the errors are handled properly. This also has the side-effect of fixing a problem where we would clear 'ret' on error because we unconditionally ran btrfs_run_delayed_refs(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c78a10aebb |
btrfs: fix reloc root leak with 0 ref reloc roots on recovery
When recovering a relocation, if we run into a reloc root that has 0 refs we simply add it to the reloc_control->reloc_roots list, and then clean it up later. The problem with this is __del_reloc_root() doesn't do anything if the root isn't in the radix tree, which in this case it won't be because we never call __add_reloc_root() on the reloc_root. This exit condition simply isn't correct really. During normal operation we can remove ourselves from the rb tree and then we're meant to clean up later at merge_reloc_roots() time, and this happens correctly. During recovery we're depending on free_reloc_roots() to drop our references, but we're short-circuiting. Fix this by continuing to check if we're on the list and dropping ourselves from the reloc_control root list and dropping our reference appropriately. Change the corresponding BUG_ON() to an ASSERT() that does the correct thing if we aren't in the rb tree. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2e626e5673 |
btrfs: remove repeated word in struct member comment
Comment for processed extent end of range has an unnecessary "in", remove it. Signed-off-by: Nigel Christian <nigel.l.christian@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
81e75ac74e |
btrfs: account for new extents being deleted in total_bytes_pinned
My recent patch set "A variety of lock contention fixes", found here https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/ (Tracked in https://github.com/btrfs/linux/issues/86) that reduce lock contention on the extent root by running delayed refs less often resulted in a regression in generic/371. This test fallocate()'s the fs until it's full, deletes all the files, and then tries to fallocate() until full again. Before these patches we would run all of the delayed refs during flushing, and then would commit the transaction because we had plenty of pinned space to recover in order to allocate. However my patches made it so we weren't running the delayed refs as aggressively, which meant that we appeared to have less pinned space when we were deciding to commit the transaction. We use the space_info->total_bytes_pinned to approximate how much space we have pinned. It's approximate because if we remove a reference to an extent we may free it, but there may be more references to it than we know of at that point, but we account it as pinned at the creation time, and then it's properly accounted when the delayed ref runs. The way we account for pinned space is if the delayed_ref_head->total_ref_mod is < 0, because that is clearly a freeing option. However there is another case, and that is where ->total_ref_mod == 0 && ->must_insert_reserved == 1. When we allocate a new extent, we have ->total_ref_mod == 1 and we have ->must_insert_reserved == 1. This is used to indicate that it is a brand new extent and will need to have its extent entry added before we modify any references on the delayed ref head. But if we subsequently remove that extent reference, our ->total_ref_mod will be 0, and that space will be pinned and freed. Accounting for this case properly allows for generic/371 to pass with my delayed refs patches applied. It's important to note that this problem exists without the referenced patches, it just was uncovered by them. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2187374f35 |
btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself
Currently we pass things around to figure out if we maybe freeing data based on the state of the delayed refs head. This makes the accounting sort of confusing and hard to follow, as it's distinctly separate from the delayed ref heads stuff, but also depends on it entirely. Fix this by explicitly adjusting the space_info->total_bytes_pinned in the delayed refs code. We now have two places where we modify this counter, once where we create the delayed and destroy the delayed refs, and once when we pin and unpin the extents. This means there is a slight overlap between delayed refs and the pin/unpin mechanisms, but this is simply used by the ENOSPC infrastructure to determine if we need to commit the transaction, so there's no adverse affect from this, we might simply commit thinking it will give us enough space when it might not. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e9aa7c285d |
btrfs: enable W=1 checks for btrfs
Now that the btrfs' codebase is clean of almost all W=1 warnings let's enable those checks unconditionally for the entire fs/btrfs/ and its subdirectories to catch potential errors during development. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8c31a3dbaa |
btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev
This fixes warning:
fs/btrfs/zoned.c:491:6: warning: variable ‘zone_size’ set but not used [-Wunused-but-set-variable]
491 | u64 zone_size;
which got introduced in
|
|
|
|
3bed2da1b0 |
btrfs: fix parameter description for functions in extent_io.c
This makes the file W=1 clean and fixes the following warnings: fs/btrfs/extent_io.c:414: warning: Function parameter or member 'tree' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'offset' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'next_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'prev_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'p_ret' not described in '__etree_search' fs/btrfs/extent_io.c:414: warning: Function parameter or member 'parent_ret' not described in '__etree_search' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'tree' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'start_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'end_ret' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1607: warning: Function parameter or member 'bits' not described in 'find_contiguous_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'tree' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'start_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'end_ret' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:1644: warning: Function parameter or member 'bits' not described in 'find_first_clear_extent_bit' fs/btrfs/extent_io.c:4187: warning: Function parameter or member 'epd' not described in 'extent_write_cache_pages' fs/btrfs/extent_io.c:4187: warning: Excess function parameter 'data' description in 'extent_write_cache_pages' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d98b188ea4 |
btrfs: fix parameter description in space-info.c
With these fixes space-info.c is clear for W=1 warnings, namely the following ones are fixed: fs/btrfs/space-info.c:575: warning: Function parameter or member 'fs_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:575: warning: Function parameter or member 'space_info' not described in 'may_commit_transaction' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'fs_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'space_info' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'ticket' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1231: warning: Function parameter or member 'flush' not described in 'handle_reserve_ticket' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'fs_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'space_info' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'orig_bytes' not described in '__reserve_bytes' fs/btrfs/space-info.c:1315: warning: Function parameter or member 'flush' not described in '__reserve_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'root' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'block_rsv' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'orig_bytes' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1427: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_metadata_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'fs_info' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'bytes' not described in 'btrfs_reserve_data_bytes' fs/btrfs/space-info.c:1462: warning: Function parameter or member 'flush' not described in 'btrfs_reserve_data_bytes' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b762d1d08d |
btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space
Fixes following warnings: fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'inode' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:205: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_inode_rsv_release' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'reserved' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Function parameter or member 'qgroup_free' not described in 'btrfs_delalloc_release_space' fs/btrfs/delalloc-space.c:472: warning: Excess function parameter 'release_bytes' description in 'btrfs_delalloc_release_space' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6e353e3b3c |
btrfs: document btrfs_check_shared parameters
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2639631d34 |
btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs
Fixes fs/btrfs/inode.c:3101: warning: Function parameter or member 'fs_info' not described in 'btrfs_wait_on_delayed_iputs' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9ee9b97990 |
btrfs: document fs_info in btrfs_rmap_block
Fixes fs/btrfs/block-group.c:1570: warning: Function parameter or member 'fs_info' not described in 'btrfs_rmap_block' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9241969547 |
btrfs: document now parameter of peek_discard_list
Fixes fs/btrfs/discard.c:203: warning: Function parameter or member 'now' not described in 'peek_discard_list' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f092cf3cfd |
btrfs: improve parameter description for __btrfs_write_out_cache
Fixes following W=1 warnings: fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'root' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'inode' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'block_group' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'io_ctl' not described in '__btrfs_write_out_cache' fs/btrfs/free-space-cache.c:1317: warning: Function parameter or member 'trans' not described in '__btrfs_write_out_cache' Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
696eb22b67 |
btrfs: fix parameter description in delayed-ref.c functions
This fixes the following warnings: fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:80: warning: Function parameter or member 'nr' not described in 'btrfs_delayed_refs_rsv_release' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'fs_info' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'src' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:128: warning: Function parameter or member 'num_bytes' not described in 'btrfs_migrate_to_delayed_refs_rsv' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'fs_info' not described in 'btrfs_delayed_refs_rsv_refill' fs/btrfs/delayed-ref.c:174: warning: Function parameter or member 'flush' not described in 'btrfs_delayed_refs_rsv_refill' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ca4207ae13 |
btrfs: fix function description formats in file-item.c
This fixes following W=1 warnings: fs/btrfs/file-item.c:27: warning: Cannot understand * @inode: the inode we want to update the disk_i_size for on line 27 - I thought it was a doc line fs/btrfs/file-item.c:65: warning: Cannot understand * @inode - the inode we're modifying on line 65 - I thought it was a doc line fs/btrfs/file-item.c:91: warning: Cannot understand * @inode - the inode we're modifying on line 91 - I thought it was a doc line Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9ad37bb3ff |
btrfs: fix parameter description of btrfs_add_extent_mapping
This fixes the following compiler warnings: fs/btrfs/extent_map.c:601: warning: Function parameter or member 'fs_info' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_tree' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'em_in' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'start' not described in 'btrfs_add_extent_mapping' fs/btrfs/extent_map.c:601: warning: Function parameter or member 'len' not described in 'btrfs_add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
401bd2dd12 |
btrfs: document modified parameter of add_extent_mapping
Fixes fs/btrfs/extent_map.c:399: warning: Function parameter or member 'modified' not described in 'add_extent_mapping' Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c198fe064 |
btrfs: rework the order of btrfs_ordered_extent::flags
[BUG]
There is a long existing bug in the last parameter of
btrfs_add_ordered_extent(), in commit
|
|
|
|
fe3b7bb085 |
btrfs: remove redundant NULL check before kvfree
Fix below warnings reported by coccicheck: ./fs/btrfs/raid56.c:237:2-8: WARNING: NULL check before some freeing functions is not needed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Yang Li <abaci-bugfix@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7e2a870a59 |
btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node
Zygo reported the following panic when testing my error handling patches for relocation: kernel BUG at fs/btrfs/backref.c:2545! invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14 Hardware name: QEMU Standard PC (i440FX + PIIX, Call Trace: btrfs_backref_error_cleanup+0x4df/0x530 build_backref_tree+0x1a5/0x700 ? _raw_spin_unlock+0x22/0x30 ? release_extent_buffer+0x225/0x280 ? free_extent_buffer.part.52+0xd7/0x140 relocate_tree_blocks+0x2a6/0xb60 ? kasan_unpoison_shadow+0x35/0x50 ? do_relocation+0xc10/0xc10 ? kasan_kmalloc+0x9/0x10 ? kmem_cache_alloc_trace+0x6a3/0xcb0 ? free_extent_buffer.part.52+0xd7/0x140 ? rb_insert_color+0x342/0x360 ? add_tree_block.isra.36+0x236/0x2b0 relocate_block_group+0x2eb/0x780 ? merge_reloc_roots+0x470/0x470 btrfs_relocate_block_group+0x26e/0x4c0 btrfs_relocate_chunk+0x52/0x120 btrfs_balance+0xe2e/0x18f0 ? pvclock_clocksource_read+0xeb/0x190 ? btrfs_relocate_chunk+0x120/0x120 ? lock_contended+0x620/0x6e0 ? do_raw_spin_lock+0x1e0/0x1e0 ? do_raw_spin_unlock+0xa8/0x140 btrfs_ioctl_balance+0x1f9/0x460 btrfs_ioctl+0x24c8/0x4380 ? __kasan_check_read+0x11/0x20 ? check_chain_key+0x1f4/0x2f0 ? __asan_loadN+0xf/0x20 ? btrfs_ioctl_get_supported_features+0x30/0x30 ? kvm_sched_clock_read+0x18/0x30 ? check_chain_key+0x1f4/0x2f0 ? lock_downgrade+0x3f0/0x3f0 ? handle_mm_fault+0xad6/0x2150 ? do_vfs_ioctl+0xfc/0x9d0 ? ioctl_file_clone+0xe0/0xe0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags.part.50+0x6c/0x1e0 ? check_flags+0x26/0x30 ? lock_is_held_type+0xc3/0xf0 ? syscall_enter_from_user_mode+0x1b/0x60 ? do_syscall_64+0x13/0x80 ? rcu_read_lock_sched_held+0xa1/0xd0 ? __kasan_check_read+0x11/0x20 ? __fget_light+0xae/0x110 __x64_sys_ioctl+0xc3/0x100 do_syscall_64+0x37/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This occurs because of this check if (RB_EMPTY_NODE(&upper->rb_node)) BUG_ON(!list_empty(&node->upper)); As we are dropping the backref node, if we discover that our upper node in the edge we just cleaned up isn't linked into the cache that we are now done with this node, thus the BUG_ON(). However this is an erroneous assumption, as we will look up all the references for a node first, and then process the pending edges. All of the 'upper' nodes in our pending edges won't be in the cache's rb_tree yet, because they haven't been processed. We could very well have many edges still left to cleanup on this node. The fact is we simply do not need this check, we can just process all of the edges only for this node, because below this check we do the following if (list_empty(&upper->lower)) { list_add_tail(&upper->lower, &cache->leaves); upper->lowest = 1; } If the upper node truly isn't used yet, then we add it to the cache->leaves list to be cleaned up later. If it is still used then the last child node that has it linked into its node will add it to the leaves list and then it will be cleaned up. Fix this problem by dropping this logic altogether. With this fix I no longer see the panic when testing with error injection in the backref code. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f7ba2d3751 |
btrfs: keep track of the root owner for relocation reads
While testing the error paths in relocation, I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc3+ #206 Not tainted ------------------------------------------------------ btrfs-balance/1571 is trying to acquire lock: ffff8cdbcc8f77d0 (&head_ref->mutex){+.+.}-{3:3}, at: btrfs_lookup_extent_info+0x156/0x3b0 but task is already holding lock: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-tree-00){++++}-{3:3}: down_write_nested+0x43/0x80 __btrfs_tree_lock+0x27/0x100 btrfs_search_slot+0x248/0x890 relocate_tree_blocks+0x490/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 -> #1 (btrfs-csum-01){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x5ab/0x890 btrfs_del_csums+0x10b/0x3c0 __btrfs_free_extent+0x49d/0x8e0 __btrfs_run_delayed_refs+0x283/0x11f0 btrfs_run_delayed_refs+0x86/0x220 btrfs_start_dirty_block_groups+0x2ba/0x520 kretprobe_trampoline+0x0/0x50 -> #0 (&head_ref->mutex){+.+.}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 __mutex_lock+0x7e/0x7b0 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 relocate_block_group+0x1ba/0x5d0 kretprobe_trampoline+0x0/0x50 other info that might help us debug this: Chain exists of: &head_ref->mutex --> btrfs-csum-01 --> btrfs-tree-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-csum-01); lock(btrfs-tree-00); lock(&head_ref->mutex); *** DEADLOCK *** 5 locks held by btrfs-balance/1571: #0: ffff8cdb89749ff8 (&fs_info->delete_unused_bgs_mutex){+.+.}-{3:3}, at: btrfs_balance+0x563/0xf40 #1: ffff8cdb89748838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x156/0x300 #2: ffff8cdbc2c16650 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x413/0x5c0 #3: ffff8cdbc135f538 (btrfs-treloc-01){+.+.}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 #4: ffff8cdbc54adbf8 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x100 stack backtrace: CPU: 1 PID: 1571 Comm: btrfs-balance Not tainted 5.10.0-rc3+ #206 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1167/0x2150 lock_acquire+0x116/0x3e0 ? btrfs_lookup_extent_info+0x156/0x3b0 __mutex_lock+0x7e/0x7b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? btrfs_lookup_extent_info+0x156/0x3b0 ? release_extent_buffer+0x124/0x170 ? _raw_spin_unlock+0x1f/0x30 ? release_extent_buffer+0x124/0x170 btrfs_lookup_extent_info+0x156/0x3b0 walk_down_proc+0x1c3/0x280 walk_down_tree+0x64/0xe0 btrfs_drop_subtree+0x182/0x260 do_relocation+0x52e/0x660 relocate_tree_blocks+0x2ae/0x650 ? add_tree_block+0x149/0x1b0 relocate_block_group+0x1ba/0x5d0 elfcorehdr_read+0x40/0x40 ? elfcorehdr_read+0x40/0x40 ? btrfs_balance+0x796/0xf40 ? __kthread_parkme+0x66/0x90 ? btrfs_balance+0xf40/0xf40 ? balance_kthread+0x37/0x50 ? kthread+0x137/0x150 ? __kthread_bind_mask+0x60/0x60 ? ret_from_fork+0x1f/0x30 As you can see this is bogus, we never take another tree's lock under the csum lock. This happens because sometimes we have to read tree blocks from disk without knowing which root they belong to during relocation. We defaulted to an owner of 0, which translates to an fs tree. This is fine as all fs trees have the same class, but obviously isn't fine if the block belongs to a COW only tree. Thankfully COW only trees only have their owners root as a reference to them, and since we already look up the extent information during relocation, go ahead and check and see if this block might belong to a COW only tree, and if so save the owner in the tree_block struct. This allows us to read_tree_block with the proper owner, which gets rid of this lockdep splat. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c0f0a9e716 |
btrfs: introduce helper to grab an existing extent buffer from a page
This patch will extract the code to grab an extent buffer from a page into a helper, grab_extent_buffer_from_page(). This reduces one indent level, and provides the work place for later expansion for subapge support. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c0fab48095 |
btrfs: update comment for btrfs_dirty_pages
The original comment is from the initial merge, which has several problems: - No holes check any more - No inline decision is made Update the out-of-date comment with more correct one. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6bc5636a67 |
btrfs: refactor __extent_writepage_io() to improve readability
The refactoring involves the following modifications: - iosize alignment In fact we don't really need to manually do alignment at all. All extent maps should already be aligned, thus basic ASSERT() check would be enough. - redundant variables We have extra variable like blocksize/pg_offset/end. They are all unnecessary. @blocksize can be replaced by sectorsize size directly, and it's only used to verify the em start/size is aligned. @pg_offset can be easily calculated using @cur and page_offset(page). @end is just assigned from @page_end and never modified, use "start + PAGE_SIZE - 1" directly and remove @page_end. - remove some BUG_ON()s The BUG_ON()s are for extent map, which we have tree-checker to check on-disk extent data item and runtime check. ASSERT() should be enough. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0c64c33c60 |
btrfs: rename parameter offset to disk_bytenr in submit_extent_page
The parameter offset is confusing, it's supposed to be the disk bytenr of metadata/data. Rename it to disk_bytenr and update the comment. Also rename each offset passed to submit_extent_page() as @disk_bytenr so they're consistent. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
58f74b2203 |
btrfs: refactor btrfs_dec_test_* functions for ordered extents
The refactoring involves the following modifications:
- Return bool instead of int
- Parameter update for @cached of btrfs_dec_test_first_ordered_pending()
For btrfs_dec_test_first_ordered_pending(), @cached is only used to
return the finished ordered extent.
Rename it to @finished_ret.
- Comment updates
* Change one stale comment
Which still refers to btrfs_dec_test_ordered_pending(), but the
context is calling btrfs_dec_test_first_ordered_pending().
* Follow the common comment style for both functions
Add more detailed descriptions for parameters and the return value
* Move the reason why test_and_set_bit() is used into the call sites
- Change how the return value is calculated
The most anti-human part of the return value is:
if (...)
ret = 1;
...
return ret == 0;
This means, when we set ret to 1, the function returns 0.
Change the local variable name to @finished, and directly return the
value of it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
523929f1ca |
btrfs: make btrfs_dio_private::bytes u32
btrfs_dio_private::bytes is only assigned from bio::bi_iter::bi_size, which is never larger than U32. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d7830b7155 |
btrfs: remove always true condition in btrfs_start_delalloc_roots
Following the rework in
|
|
|
|
9db4dc241e |
btrfs: make btrfs_start_delalloc_root's nr argument a long
It's currently u64 which gets instantly translated either to LONG_MAX (if U64_MAX is passed) or cast to an unsigned long (which is in fact, wrong because writeback_control::nr_to_write is a signed, long type). Just convert the function's argument to be long time which obviates the need to manually convert u64 value to a long. Adjust all call sites which pass U64_MAX to pass LONG_MAX. Finally ensure that in shrink_delalloc the u64 is converted to a long without overflowing, resulting in a negative number. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9c4a062a94 |
btrfs: send: remove stale code when checking for shared extents
After commit
|
|
|
|
7056bf69e5 |
btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device
Instead of having three 'if' to handle non-NULL return value consolidate this in one 'if (ret)'. That way the code is more obvious: - Always drop delete_unused_bgs_mutex if ret is not NULL - If ret is negative -> goto done - If it's 1 -> reset ret to 0, release the path and finish the loop. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1478143ac8 |
btrfs: ref-verify: make sure owner is set for all refs
I noticed that shared ref entries in ref-verify didn't have the proper owner set, which caused me to think there was something seriously wrong. However the problem is if we have a parent we simply weren't filling out the owner part of the reference, even though we have it. Fix this by making sure we set all the proper fields when we modify a reference, this way we'll have the proper owner if a problem happens and we don't waste time thinking we're updating the wrong level. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0d73a11c62 |
btrfs: ref-verify: pass down tree block level when building refs
I noticed that sometimes I would have the wrong level printed out with ref-verify while testing some error injection related problems. This is because we only get the level from the main extent item, but our references could go off the current leaf into another, and at that point we lose our level. Fix this by keeping track of the last tree block level that we found, the same way we keep track of our bytenr and num_bytes, in case we happen to wander into another leaf while still processing the references for a bytenr. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1fec12a560 |
btrfs: noinline btrfs_should_cancel_balance
I was attempting to reproduce a problem that Zygo hit, but my error injection wasn't firing for a few of the common calls to btrfs_should_cancel_balance. This is because the compiler decided to inline it at these spots. Keep this from happening by explicitly marking the function as noinline so that error injection will always work. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f75e2b79b5 |
btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block
The following patches are going to address error handling in relocation, in order to test those patches I need to be able to inject errors in btrfs_search_slot and btrfs_cow_block, as we call both of these pretty often in different cases during relocation. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
69948022c9 |
btrfs: remove new_dirid argument from btrfs_create_subvol_root
It's no longer used. While at it also remove new_dirid in create_subvol as it's used in a single place and open code it. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
23125104d8 |
btrfs: make btrfs_root::free_objectid hold the next available objectid
Adjust the way free_objectid is being initialized, it now stores BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary, BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit that now it becomes unnecessary to explicitly initialize free_objectid for a newly create fs root. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6b8fad576a |
btrfs: rename btrfs_root::highest_objectid to free_objectid
This reflects the true purpose of the member as it's being used solely in context where a new objectid is being allocated. Future changes will also change the way it's being used to closely follow this semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
543068a217 |
btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid
This better reflects the semantics of the function i.e no search is performed whatsoever. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
453e487386 |
btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid
This function is used to initialize the in-memory btrfs_root::highest_objectid member, which is used to get an available objectid. Rename it to better reflect its semantics. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
149716570b |
btrfs: cleanup local variables in btrfs_file_write_iter
First replace all inode instances with a pointer to btrfs_inode. This removes multiple invocations of the BTRFS_I macro, subsequently remove 2 local variables as they are called only once and simply refer to them directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3cc64e7ebf |
btrfs: clarify error returns values in __load_free_space_cache
Return value in __load_free_space_cache is not properly set after
(unlikely) memory allocation failures and 0 is returned instead.
This is not a problem for the caller load_free_space_cache because only
value 1 is considered as 'cache loaded' but for clarity it's better
to set the errors accordingly.
Fixes:
|
|
|
|
4f4317c13a |
btrfs: fix error handling in commit_fs_roots
While doing error injection I would sometimes get a corrupt file system. This is because I was injecting errors at btrfs_search_slot, but would only do it one time per stack. This uncovered a problem in commit_fs_roots, where if we get an error we would just break. However we're in a nested loop, the first loop being a loop to find all the dirty fs roots, and then subsequent root updates would succeed clearing the error value. This isn't likely to happen in real scenarios, however we could potentially get a random ENOMEM once and then not again, and we'd end up with a corrupted file system. Fix this by moving the error checking around a bit to the main loop, as this is the only place where something will fail, and return the error as soon as it occurs. With this patch my reproducer no longer corrupts the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c05d51c773 |
for-5.11-rc5-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAUIkAACgkQxWXV+ddt
WDsWVg/+IIEk9H1v9q9ShvVmPvmnlT8/0ywj1hdwFMBkFBjIeU8tBz9ZMGPXCzrF
XemmWKChVOnR3SIq/bMrwuRC/Gv/pBvwVshXLP51YJHv7lSGX0Ayrb27BFQcVaC/
3QhpE7veEiqxwLyMj+LWG4hE2X+oqiqzrXCpeC5un4zEluT45RSKooqueQ4jM8aw
DrKLQA57a1YEIqrE2KQzy5A6BnSNyxPXEEX34kbugmmen46Fh77hrwme1K9vQn1t
v3/V4LcarXADxxokAxU2Igb/vK0+BN33NOYsBwLWWD4kUaTGS4KczsDOowkRRTMH
/qiQUdca0X7ElR+VFl8rgB8PxuJcZ87aCdsMkErUA4sjxyp11VDIeEgirPNAcXtR
b+1LIkn3k3l8JzkKyXwDuZuNBsh0idTY24IE+QDBMIGq+jE1N6N3t5gEwa2NeaiP
9O5QnS5XAJCo8a9+gp1aF5z94vwQwvf9TA80nGrnpxGmXEEEZ9PgXsc4JON1Blhn
NtJDwBPzEjHCEYdE73/lRMsLmYeGhpRugKb+lQ+OTo2iZzxH2SjWn9vXKiN7vAp2
zysjzdPfkY5BLggH5cPg0fuRaf/Is00EeVqn3eA7QsFKDhrpoPFBO+aV5xeshsaz
8fjt7kkXFb+Vyy4SDvmPioJQ7/MFZ5Czn+BL1JwO4l/vYcEMUzM=
=/yHv
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes for a late rc:
- fix lockdep complaint on 32bit arches and also remove an unsafe
memory use due to device vs filesystem lifetime
- two fixes for free space tree:
* race during log replay and cache rebuild, now more likely to
happen due to changes in this dev cycle
* possible free space tree corruption with online conversion
during initial tree population"
* tag 'for-5.11-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix log replay failure due to race with space cache rebuild
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
btrfs: fix possible free space tree corruption with online conversion
|
|
|
|
616c6a6884 |
btrfs: use bio_kmalloc in __alloc_device
Use bio_kmalloc instead of open coding it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Acked-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
9ad6d91f05 |
btrfs: fix log replay failure due to race with space cache rebuild
After a sudden power failure we may end up with a space cache on disk that
is not valid and needs to be rebuilt from scratch.
If that happens, during log replay when we attempt to pin an extent buffer
from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for
the space cache to be rebuilt through the call to:
btrfs_cache_block_group(cache, 1);
That is because that only waits for the task (work queue job) that loads
the space cache to change the cache state from BTRFS_CACHE_FAST to any
other value. That is ok when the space cache on disk exists and is valid,
but when the cache is not valid and needs to be rebuilt, it ends up
returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done
at caching_thread()).
So this means that we can end up trying to unpin a range which is not yet
marked as free in the block group. This results in the call to
btrfs_remove_free_space() to return -EINVAL to
btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail
as well as mounting the filesystem. More specifically the -EINVAL comes
from free_space_cache.c:remove_from_bitmap(), because the requested range
is not marked as free space (ones in the bitmap), we have the following
condition triggered:
static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl,
(...)
if (ret < 0 || search_start != *offset)
return -EINVAL;
(...)
It's the "search_start != *offset" that results in the condition being
evaluated to true.
When this happens we got the following in dmesg/syslog:
[72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007)
[72383.417837] BTRFS info (device sdb): disk space caching is enabled
[72383.418536] BTRFS info (device sdb): has skinny extents
[72383.423846] BTRFS info (device sdb): start tree-log replay
[72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space
[72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now
[72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.)
[72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree)
[72383.460241] BTRFS error (device sdb): open_ctree failed
We also mark the range for the extent buffer in the excluded extents io
tree. That is fine when the space cache is valid on disk and we can load
it, in which case it causes no problems.
However, for the case where we need to rebuild the space cache, because it
is either invalid or it is missing, having the extent buffer range marked
in the excluded extents io tree leads to a -EINVAL failure from the call
to btrfs_remove_free_space(), resulting in the log replay and mount to
fail. This is because by having the range marked in the excluded extents
io tree, the caching thread ends up never adding the range of the extent
buffer as free space in the block group since the calls to
add_new_free_space(), called from load_extent_tree_free(), filter out any
ranges that are marked as excluded extents.
So fix this by making sure that during log replay we wait for the caching
task to finish completely when we need to rebuild a space cache, and also
drop the need to mark the extent buffer range in the excluded extents io
tree, as well as clearing ranges from that tree at
btrfs_finish_extent_commit().
This started to happen with some frequency on large filesystems having
block groups with a lot of fragmentation since the recent commit
|
|
|
|
c41ec4529d |
btrfs: fix lockdep warning due to seqcount_mutex on 32bit arch
This effectively reverts commit |
|
|
|
2f96e40212 |
btrfs: fix possible free space tree corruption with online conversion
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.
This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same. During this initial transaction commit we will run
all of the delayed refs that were paused during the free space tree
generation, and thus begin to cache block groups. While caching block
groups the caching thread will be reading from the main root for the
free space tree, so as we make allocations we'll be changing the free
space tree, which can cause us to add the same range twice which results
in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a
variety of different errors when running delayed refs because of a
double allocation.
Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method. We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.
CC: stable@vger.kernel.org # 4.9+
Fixes:
|
|
|
|
309dca309f |
block: store a block_device pointer in struct bio
Replace the gendisk pointer in struct bio with a pointer to the newly improved struct block device. From that the gendisk can be trivially accessed with an extra indirection, but it also allows to directly look up all information related to partition remapping. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
549c729771
|
fs: make helpers idmap mount aware
Extend some inode methods with an additional user namespace argument. A filesystem that is aware of idmapped mounts will receive the user namespace the mount has been marked with. This can be used for additional permission checking and also to enable filesystems to translate between uids and gids if they need to. We have implemented all relevant helpers in earlier patches. As requested we simply extend the exisiting inode method instead of introducing new ones. This is a little more code churn but it's mostly mechanical and doesnt't leave us with additional inode methods. Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
ba73d98745
|
namei: handle idmapped mounts in may_*() helpers
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
0d56a4518d
|
stat: handle idmapped mounts
The generic_fillattr() helper fills in the basic attributes associated with an inode. Enable it to handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace before we store the uid and gid. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
e65ce2a50c
|
acl: handle idmapped mounts
The posix acl permission checking helpers determine whether a caller is privileged over an inode according to the acls associated with the inode. Add helpers that make it possible to handle acls on idmapped mounts. The vfs and the filesystems targeted by this first iteration make use of posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to translate basic posix access and default permissions such as the ACL_USER and ACL_GROUP type according to the initial user namespace (or the superblock's user namespace) to and from the caller's current user namespace. Adapt these two helpers to handle idmapped mounts whereby we either map from or into the mount's user namespace depending on in which direction we're translating. Similarly, cap_convert_nscap() is used by the vfs to translate user namespace and non-user namespace aware filesystem capabilities from the superblock's user namespace to the caller's user namespace. Enable it to handle idmapped mounts by accounting for the mount's user namespace. In addition the fileystems targeted in the first iteration of this patch series make use of the posix_acl_chmod() and, posix_acl_update_mode() helpers. Both helpers perform permission checks on the target inode. Let them handle idmapped mounts. These two helpers are called when posix acls are set by the respective filesystems to handle this case we extend the ->set() method to take an additional user namespace argument to pass the mount's user namespace down. Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
2f221d6f7b
|
attr: handle idmapped mounts
When file attributes are changed most filesystems rely on the setattr_prepare(), setattr_copy(), and notify_change() helpers for initialization and permission checking. Let them handle idmapped mounts. If the inode is accessed through an idmapped mount map it into the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Helpers that perform checks on the ia_uid and ia_gid fields in struct iattr assume that ia_uid and ia_gid are intended values and have already been mapped correctly at the userspace-kernelspace boundary as we already do today. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
21cb47be6f
|
inode: make init and permission helpers idmapped mount aware
The inode_owner_or_capable() helper determines whether the caller is the owner of the inode or is capable with respect to that inode. Allow it to handle idmapped mounts. If the inode is accessed through an idmapped mount it according to the mount's user namespace. Afterwards the checks are identical to non-idmapped mounts. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Similarly, allow the inode_init_owner() helper to handle idmapped mounts. It initializes a new inode on idmapped mounts by mapping the fsuid and fsgid of the caller from the mount's user namespace. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
47291baa8d
|
namei: make permission helpers idmapped mount aware
The two helpers inode_permission() and generic_permission() are used by the vfs to perform basic permission checking by verifying that the caller is privileged over an inode. In order to handle idmapped mounts we extend the two helpers with an additional user namespace argument. On idmapped mounts the two helpers will make sure to map the inode according to the mount's user namespace and then peform identical permission checks to inode_permission() and generic_permission(). If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: James Morris <jamorris@linux.microsoft.com> Acked-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com> |
|
|
|
2f63296578 |
iomap: pass a flags argument to iomap_dio_rw
Pass a set of flags to iomap_dio_rw instead of the boolean wait_for_completion argument. The IOMAP_DIO_FORCE_WAIT flag replaces the wait_for_completion, but only needs to be passed when the iocb isn't synchronous to start with to simplify the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> [djwong: rework xfs_file.c so that we can push iomap changes separately] Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> |
|
|
|
9791581c04 |
for-5.11-rc4-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmAIojwACgkQxWXV+ddt
WDstnw/+O0KSsK6ChZCNdjqFAgWL41RYj0fPOgM/8xlNaQyYHS0Jczeoud6m/2Wm
U41kTb/a6xpmx0Z/2uf/5pDIBPFld/IUuUf/AdJsMzy8Bpky2/sfg6Kmx0tKGLXQ
1WKp9ox0MlAUI0Tz/jGfX5rwsIgWKYKIF2iGUio/H1ktR3l+cXlmLWsSIB43F6VL
AjKRRyFCNU//dV7syNMmmj9yU0HpSs53SpWxUIURuTFaE71LyUgzaxDTlZ6c/PET
e4wdf8nl0wzEESCgSUPdh2AWNNiTEbbGhhhNi9250PUyQki2f4AGBlxVSLZH/fDn
6PbBDvefW4umCMeMxxmgnYJU6tG78qg/LvxzZXt54rOtB0WMbrIl0u7hFCVhQ3Qk
nqrS4tqeaL+OeuR6xamBMaRohgRFa9S+QVjTwtDFo/oVYH4TVvQDfKQS6GsWwDvB
ySzz3WewoFqhe47cMsy28Dg49xkDSIJIr5hmSNGSXTreZ2JIa+qLKywoH87+YDIE
ql0PN47z4NB+MbWDV7SZM8DCVqiQ7+1LOV9bPmqfvNl3YTfvXyMaoPLmWWVstPr2
iyhXrvESgm1s2RCF1a0tXIkv82L6QYjJ3eeEDsvAmtKBouNL9BnMvwi3zW5yKiry
m1qj7C7e6C1TivYitcCfbRCKqeAnUv8VwkSbW9BvNDe7i5AD++U=
=gSYr
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one line fixes for various bugs, stable material.
- fix send when emitting clone operation from the same file and root
- fix double free on error when cleaning backrefs
- lockdep fix during relocation
- handle potential error during reloc when starting transaction
- skip running delayed refs during commit (leftover from code removal
in this dev cycle)"
* tag 'for-5.11-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: don't clear ret in btrfs_start_dirty_block_groups
btrfs: fix lockdep splat in btrfs_recover_relocation
btrfs: do not double free backref nodes on error
btrfs: don't get an EINTR during drop_snapshot for reloc
btrfs: send: fix invalid clone operations when cloning from the same file and root
btrfs: no need to run delayed refs after commit_fs_roots during commit
|
|
|
|
34d1eb0e59 |
btrfs: don't clear ret in btrfs_start_dirty_block_groups
If we fail to update a block group item in the loop we'll break, however we'll do btrfs_run_delayed_refs and lose our error value in ret, and thus not clean up properly. Fix this by only running the delayed refs if there was no failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb28610097 |
btrfs: fix lockdep splat in btrfs_recover_relocation
While testing the error paths of relocation I hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc6+ #217 Not tainted ------------------------------------------------------ mount/779 is trying to acquire lock: ffffa0e676945418 (&fs_info->balance_mutex){+.+.}-{3:3}, at: btrfs_recover_balance+0x2f0/0x340 but task is already holding lock: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-root-00){++++}-{3:3}: down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x27/0x100 btrfs_read_lock_root_node+0x31/0x40 btrfs_search_slot+0x462/0x8f0 btrfs_update_root+0x55/0x2b0 btrfs_drop_snapshot+0x398/0x750 clean_dirty_subvols+0xdf/0x120 btrfs_recover_relocation+0x534/0x5a0 btrfs_start_pre_rw_mount+0xcb/0x170 open_ctree+0x151f/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (sb_internal#2){.+.+}-{0:0}: start_transaction+0x444/0x700 insert_balance_item.isra.0+0x37/0x320 btrfs_balance+0x354/0xf40 btrfs_ioctl_balance+0x2cf/0x380 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->balance_mutex){+.+.}-{3:3}: __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 __mutex_lock+0x7e/0x7b0 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &fs_info->balance_mutex --> sb_internal#2 --> btrfs-root-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(sb_internal#2); lock(btrfs-root-00); lock(&fs_info->balance_mutex); *** DEADLOCK *** 2 locks held by mount/779: #0: ffffa0e60dc040e0 (&type->s_umount_key#47/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffa0e60ee31da8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x100 stack backtrace: CPU: 0 PID: 779 Comm: mount Not tainted 5.10.0-rc6+ #217 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 ? trace_call_bpf+0x139/0x260 __lock_acquire+0x1120/0x1e10 lock_acquire+0x116/0x370 ? btrfs_recover_balance+0x2f0/0x340 __mutex_lock+0x7e/0x7b0 ? btrfs_recover_balance+0x2f0/0x340 ? btrfs_recover_balance+0x2f0/0x340 ? rcu_read_lock_sched_held+0x3f/0x80 ? kmem_cache_alloc_trace+0x2c4/0x2f0 ? btrfs_get_64+0x5e/0x100 btrfs_recover_balance+0x2f0/0x340 open_ctree+0x1095/0x1726 btrfs_mount_root.cold+0x12/0xea ? rcu_read_lock_sched_held+0x3f/0x80 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x380 ? __kmalloc_track_caller+0x2f2/0x320 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 ? capable+0x3a/0x60 path_mount+0x433/0xc10 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is straightforward to fix, simply release the path before we setup the balance_ctl. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
49ecc679ab |
btrfs: do not double free backref nodes on error
Zygo reported the following KASAN splat:
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache(). This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up. However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache. Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.
Fixes:
|
|
|
|
18d3bff411 |
btrfs: don't get an EINTR during drop_snapshot for reloc
This was partially fixed by |
|
|
|
518837e650 |
btrfs: send: fix invalid clone operations when cloning from the same file and root
When an incremental send finds an extent that is shared, it checks which file extent items in the range refer to that extent, and for those it emits clone operations, while for others it emits regular write operations to avoid corruption at the destination (as described and fixed by commit |
|
|
|
14ff8e1970 |
btrfs: no need to run delayed refs after commit_fs_roots during commit
The inode number cache has been removed in this dev cycle, there's one
more leftover. We don't need to run the delayed refs again after
commit_fs_roots as stated in the comment, because btrfs_save_ino_cache
is no more since
|
|
|
|
6e68b9961f |
for-5.11-rc3-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/8jD4ACgkQxWXV+ddt
WDteWQ//QcpD6STpLwAC+g6zJyJln7Au9lfQvawugvOJssbtdPkJQP3ZiK+Izwi/
/xagu6XMazJM+47acNJKDNntOqVkp+O6CxEbLU+rL/D288L3HEGxayZ2LL90wm6J
tbIebOE+BSVZ/5oe0jVdqZXwYvUtTiJ7PoFgrZPXJCnddSitZRD3tC4Wmi/Yo5+0
+7CW6PT3/s7KARwYXpgpMM5vi8qO2nfHfTUdRlSh59g7zC/TH7HiitL6roHzlX1k
g/aaKYLVcg62OPpw7ZXwde/qH8n1TR+H5WX6vBInqd/9jYcNkVGqijCgBeL1TJkN
Vx/b69ccODK2GNzuuYoo3k3XvSwZWsOTZp+k4y3EZ1cMONMo1snu/xglYsvSZvUL
lNCQlA9hIZNskRwEvkEea68/bQdiOl6xezgR9tajMlmz7oCsV/Cz/MJ+RfqaxdH3
bV6eTTex67lQfzAda+gN+zjBrFzQdmK700gKimdzF1XfcYmmCIdZVX8Gm/N6ldQN
LNRe8zYRaqrmRk9PQ355RqYDZmft/wLiUV6V0j74oV65WpPe2R4pULWdmPAGm6Oj
UWM+ZR3u9m8asg7ghKYgct2pxCS3+gLbDNXNcOSxYxthEEZB2JqkAMjtjCfwJilN
PXfuXaBKRmRck+AcYfbBrfJOljQ+zAJdTK/Rid40TwwpFCe/jjY=
=G3R4
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"More material for stable trees.
- tree-checker: check item end overflow
- fix false warning during relocation regarding extent type
- fix inode flushing logic, caused notable performance regression
(since 5.10)
- debugging fixups:
- print correct offset for reloc tree key
- pass reliable fs_info pointer to error reporting helper"
* tag 'for-5.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: shrink delalloc pages instead of full inodes
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
btrfs: tree-checker: check if chunk item end overflows
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
btrfs: print the actual offset in btrfs_root_name
|
|
|
|
e076ab2a2c |
btrfs: shrink delalloc pages instead of full inodes
Commit |
|
|
|
50e31ef486 |
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.
Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.
The dmesg would look like:
[438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
[438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
[450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
[463.501781] BTRFS info (device dm-10): balance: ended with status: -2
[CAUSE]
The ENOENT error is returned from the following call chain:
add_data_references()
|- delete_v1_space_cache();
|- if (!found)
return -ENOENT;
The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.
With extra debugging, the offending tree block looks like this:
leaf bytenr = 42676709441536, data_bytenr = 34626327621632
ctime 1567904822.739884119 (2019-09-08 03:07:02)
mtime 0.0 (1970-01-01 01:00:00)
otime 0.0 (1970-01-01 01:00:00)
item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
generation 1517381 type 2 (prealloc)
prealloc data disk byte 34626327621632 nr 262144 <<<
prealloc data offset 0 nr 262144
item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
uuid d0d4361f-d231-6d40-8901-fe506e4b2b53
Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.
[FIX]
The code is modified in commit
|
|
|
|
347fb0cfc9 |
btrfs: tree-checker: check if chunk item end overflows
While mounting a crafted image provided by user, kernel panics due to the invalid chunk item whose end is less than start. [66.387422] loop: module loaded [66.389773] loop0: detected capacity change from 262144 to 0 [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613) [66.431061] BTRFS info (device loop0): disk space caching is enabled [66.431078] BTRFS info (device loop0): has skinny extents [66.437101] BTRFS error: insert state: end < start 29360127 37748736 [66.437136] ------------[ cut here ]------------ [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs] [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45 [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs] [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286 [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001 [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0 [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000 [66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [66.437460] PKRU: 55555554 [66.437464] Call Trace: [66.437475] set_extent_bit+0x652/0x740 [btrfs] [66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs] [66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs] [66.437621] read_one_chunk+0x33c/0x420 [btrfs] [66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs] [66.437708] ? kvm_sched_clock_read+0x18/0x40 [66.437739] open_ctree+0xb32/0x1734 [btrfs] [66.437781] ? bdi_register_va+0x1b/0x20 [66.437788] ? super_setup_bdi_name+0x79/0xd0 [66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs] [66.437854] ? __kmalloc_track_caller+0x217/0x3b0 [66.437873] legacy_get_tree+0x34/0x60 [66.437880] vfs_get_tree+0x2d/0xc0 [66.437888] vfs_kern_mount.part.0+0x78/0xc0 [66.437897] vfs_kern_mount+0x13/0x20 [66.437902] btrfs_mount+0x11f/0x3c0 [btrfs] [66.437940] ? kfree+0x5ff/0x670 [66.437944] ? __kmalloc_track_caller+0x217/0x3b0 [66.437962] legacy_get_tree+0x34/0x60 [66.437974] vfs_get_tree+0x2d/0xc0 [66.437983] path_mount+0x48c/0xd30 [66.437998] __x64_sys_mount+0x108/0x140 [66.438011] do_syscall_64+0x38/0x50 [66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [66.438023] RIP: 0033:0x7f0138827f6e [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0 [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001 [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440 [66.438078] irq event stamp: 18169 [66.438082] hardirqs last enabled at (18175): [<ffffffffb81154bf>] console_unlock+0x4ff/0x5f0 [66.438088] hardirqs last disabled at (18180): [<ffffffffb8115427>] console_unlock+0x467/0x5f0 [66.438092] softirqs last enabled at (16910): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20 [66.438097] softirqs last disabled at (16905): [<ffffffffb8a00fe2>] asm_call_irq_on_stack+0x12/0x20 [66.438103] ---[ end trace e114b111db64298b ]--- [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127 [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists) [66.441069] ------------[ cut here ]------------ [66.441072] kernel BUG at fs/btrfs/extent_io.c:679! [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45 [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs] [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246 [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001 [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0 [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000 [66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [66.462196] PKRU: 55555554 [66.462692] Call Trace: [66.463139] set_extent_bit.cold+0x30/0x98 [btrfs] [66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs] [66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs] [66.514097] read_one_chunk+0x33c/0x420 [btrfs] [66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs] [66.555718] ? kvm_sched_clock_read+0x18/0x40 [66.575758] open_ctree+0xb32/0x1734 [btrfs] [66.595272] ? bdi_register_va+0x1b/0x20 [66.614638] ? super_setup_bdi_name+0x79/0xd0 [66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs] [66.652938] ? __kmalloc_track_caller+0x217/0x3b0 [66.671925] legacy_get_tree+0x34/0x60 [66.690300] vfs_get_tree+0x2d/0xc0 [66.708221] vfs_kern_mount.part.0+0x78/0xc0 [66.725808] vfs_kern_mount+0x13/0x20 [66.742730] btrfs_mount+0x11f/0x3c0 [btrfs] [66.759350] ? kfree+0x5ff/0x670 [66.775441] ? __kmalloc_track_caller+0x217/0x3b0 [66.791750] legacy_get_tree+0x34/0x60 [66.807494] vfs_get_tree+0x2d/0xc0 [66.823349] path_mount+0x48c/0xd30 [66.838753] __x64_sys_mount+0x108/0x140 [66.854412] do_syscall_64+0x38/0x50 [66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [66.885093] RIP: 0033:0x7f0138827f6e [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0 [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001 [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440 [67.216138] ---[ end trace e114b111db64298c ]--- [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs] [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246 [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000 [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001 [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0 [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000 [67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000 [67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0 [67.558449] PKRU: 55555554 [67.581146] note: mount[613] exited with preempt_count 2 The image has a chunk item which has a logical start 37748736 and length 18446744073701163008 (-8M). The calculated end 29360127 overflows. EEXIST was caught by insert_state() because of the duplicate end and extent_io_tree_panic() was called. Add overflow check of chunk item end to tree checker so it can be detected early at mount time. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929 CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Su Yue <l@damenly.su> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
29b665cc51 |
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.
Pass tree->fs_info as it's known to be valid in all cases.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes:
|
|
|
|
71008734d2 |
btrfs: print the actual offset in btrfs_root_name
We're supposed to print the root_key.offset in btrfs_root_name in the
case of a reloc root, not the objectid. Fix this helper to take the key
so we have access to the offset when we need it.
Fixes:
|
|
|
|
71c061d244 |
for-5.11-rc2-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/0cI8ACgkQxWXV+ddt
WDspQw/8DcC8zhGgunk0m2kcXd6dFOGbsr3hNGCsgUSKESRw6AgTZ0rJf/QLjayF
/vaJWzQW9ijfZ92fWZS+mrmskk0N8RFOsEvkCRLesgRaasbrkchLBo5HGQasOBEV
LXyU878GrBkNaHzClJz+JdU26i0d17BFdddgtZVQ1St9Wd9ecc7Q6iqG80RWFeE7
uVbhv+QjocM3EieOnwIy5Mz6jZgJLYwqw7/y2njKduBeJtbt1K1j/y7IJk0WFMUM
8eUpDL6vlAHB8FjV2wWOzO46bbEaUpaBADM6yabrq0lnM0kr7Rb+WV/WSLM/AZ3g
Hzs4qROOEP+zjfZ5nYjJQDJRMpSipZomsUY5uMZnhRxlZuHPaoBotRRzs5AIZYj2
BnkfucOcjxS/JTBD//ltJXE8RxbMIyMBBBipbBwqmxOkR9gM9BPuJ6iJPfUX//gG
1GHJ+FPns8ua3JW21ih6H31xNEPS36tsywvE8yCEtEWMxCFCBwgGu+4D8KpGBjtY
ySFxkxxAbTuFi9fqSE/mBC+6lpbVTO0OvizuoEQh8C2izkXRbDsDVgPN8d7rCW7h
Cdox4DUp61sNf+G3ll9Dv9ceAXroZTVRTHGjlav6NAFpydz3yPo5x54Ex7S+k3oN
BAcZEl1Tl3hz4WxF8Ywc+yJ8n8l9AVa3KcYRXVbyVjTGg+JjU94=
=jlQf
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that arrived before the end of the year:
- a bunch of fixes related to transaction handle lifetime wrt various
operations (umount, remount, qgroup scan, orphan cleanup)
- async discard scheduling fixes
- fix item size calculation when item keys collide for extend refs
(hardlinks)
- fix qgroup flushing from running transaction
- fix send, wrong file path when there is an inode with a pending
rmdir
- fix deadlock when cloning inline extent and low on free metadata
space"
* tag 'for-5.11-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: run delayed iputs when remounting RO to avoid leaking them
btrfs: add assertion for empty list of transactions at late stage of umount
btrfs: fix race between RO remount and the cleaner task
btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
btrfs: merge critical sections of discard lock in workfn
btrfs: fix racy access to discard_ctl data
btrfs: fix async discard stall
btrfs: tests: initialize test inodes location
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
btrfs: correctly calculate item size used when item key collision happens
btrfs: fix deadlock when cloning inline extent and low on free metadata space
|
|
|
|
a8cc263eb5 |
btrfs: run delayed iputs when remounting RO to avoid leaking them
When remounting RO, after setting the superblock with the RO flag, the cleaner task will start sleeping and do nothing, since the call to btrfs_need_cleaner_sleep() keeps returning 'true'. However, when the cleaner task goes to sleep, the list of delayed iputs may not be empty. As long as we are in RO mode, the cleaner task will keep sleeping and never run the delayed iputs. This means that if a filesystem unmount is started, we get into close_ctree() with a non-empty list of delayed iputs, and because the filesystem is in RO mode and is not in an error state (or a transaction aborted), btrfs_error_commit_super() and btrfs_commit_super(), which run the delayed iputs, are never called, and later we fail the assertion that checks if the delayed iputs list is empty: assertion failed: list_empty(&fs_info->delayed_iputs), in fs/btrfs/disk-io.c:4049 ------------[ cut here ]------------ kernel BUG at fs/btrfs/ctree.h:3153! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 1 PID: 3780621 Comm: umount Tainted: G L 5.6.0-rc2-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014 RIP: 0010:assertfail.constprop.0+0x18/0x26 [btrfs] Code: 8b 7b 58 48 85 ff 74 (...) RSP: 0018:ffffb748c89bbdf8 EFLAGS: 00010246 RAX: 0000000000000051 RBX: ffff9608f2584000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff91998988 RDI: 00000000ffffffff RBP: ffff9608f25870d8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffc0cbc500 R13: ffffffff92411750 R14: 0000000000000000 R15: ffff9608f2aab250 FS: 00007fcbfaa66c80(0000) GS:ffff960936c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fffc2c2dd38 CR3: 0000000235e54002 CR4: 00000000003606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x1a2/0x2e6 [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x93/0xc0 exit_to_usermode_loop+0xf9/0x100 do_syscall_64+0x20d/0x260 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fcbfaca6307 Code: eb 0b 00 f7 d8 64 89 (...) RSP: 002b:00007fffc2c2ed68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000558203b559b0 RCX: 00007fcbfaca6307 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000558203b55bc0 RBP: 0000000000000000 R08: 0000000000000001 R09: 00007fffc2c2dad0 R10: 0000558203b55bf0 R11: 0000000000000246 R12: 0000558203b55bc0 R13: 00007fcbfadcc204 R14: 0000558203b55aa8 R15: 0000000000000000 Modules linked in: btrfs dm_flakey dm_log_writes (...) ---[ end trace d44d303790049ef6 ]--- So fix this by making the remount RO path run any remaining delayed iputs after waiting for the cleaner to become inactive. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0a31daa4b6 |
btrfs: add assertion for empty list of transactions at late stage of umount
Add an assertion to close_ctree(), after destroying all the work queues, to verify we do not have any transaction still open or committing at that at that point. If we have any, it means something is seriously wrong and that can cause memory leaks and use-after-free problems. This is motivated by the previous patches that fixed bugs where we ended up leaking an open transaction after unmounting the filesystem. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a0a1db70df |
btrfs: fix race between RO remount and the cleaner task
When we are remounting a filesystem in RO mode we can race with the cleaner
task and result in leaking a transaction if the filesystem is unmounted
shortly after, before the transaction kthread had a chance to commit that
transaction. That also results in a crash during unmount, due to a
use-after-free, if hardware acceleration is not available for crc32c.
The following sequence of steps explains how the race happens.
1) The filesystem is mounted in RW mode and the cleaner task is running.
This means that currently BTRFS_FS_CLEANER_RUNNING is set at
fs_info->flags;
2) The cleaner task is currently running delayed iputs for example;
3) A filesystem RO remount operation starts;
4) The RO remount task calls btrfs_commit_super(), which commits any
currently open transaction, and it finishes;
5) At this point the cleaner task is still running and it creates a new
transaction by doing one of the following things:
* When running the delayed iput() for an inode with a 0 link count,
in which case at btrfs_evict_inode() we start a transaction through
the call to evict_refill_and_join(), use it and then release its
handle through btrfs_end_transaction();
* When deleting a dead root through btrfs_clean_one_deleted_snapshot(),
a transaction is started at btrfs_drop_snapshot() and then its handle
is released through a call to btrfs_end_transaction_throttle();
* When the remount task was still running, and before the remount task
called btrfs_delete_unused_bgs(), the cleaner task also called
btrfs_delete_unused_bgs() and it picked and removed one block group
from the list of unused block groups. Before the cleaner task started
a transaction, through btrfs_start_trans_remove_block_group() at
btrfs_delete_unused_bgs(), the remount task had already called
btrfs_commit_super();
6) So at this point the filesystem is in RO mode and we have an open
transaction that was started by the cleaner task;
7) Shortly after a filesystem unmount operation starts. At close_ctree()
we stop the transaction kthread before it had a chance to commit the
transaction, since less than 30 seconds (the default commit interval)
have elapsed since the last transaction was committed;
8) We end up calling iput() against the btree inode at close_ctree() while
there is an open transaction, and since that transaction was used to
update btrees by the cleaner, we have dirty pages in the btree inode
due to COW operations on metadata extents, and therefore writeback is
triggered for the btree inode.
So btree_write_cache_pages() is invoked to flush those dirty pages
during the final iput() on the btree inode. This results in creating a
bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();
9) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
10) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by making the remount path to wait for the cleaner task before
calling btrfs_commit_super(). The remount path now waits for the bit
BTRFS_FS_CLEANER_RUNNING to be cleared from fs_info->flags before calling
btrfs_commit_super() and this ensures the cleaner can not start a
transaction after that, because it sleeps when the filesystem is in RO
mode and we have already flagged the filesystem as RO before waiting for
BTRFS_FS_CLEANER_RUNNING to be cleared.
This also introduces a new flag BTRFS_FS_STATE_RO to be used for
fs_info->fs_state when the filesystem is in RO mode. This is because we
were doing the RO check using the flags of the superblock and setting the
RO mode simply by ORing into the superblock's flags - those operations are
not atomic and could result in the cleaner not seeing the update from the
remount task after it clears BTRFS_FS_CLEANER_RUNNING.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
638331fa56 |
btrfs: fix transaction leak and crash after cleaning up orphans on RO mount
When we delete a root (subvolume or snapshot), at the very end of the
operation, we attempt to remove the root's orphan item from the root tree,
at btrfs_drop_snapshot(), by calling btrfs_del_orphan_item(). We ignore any
error from btrfs_del_orphan_item() since it is not a serious problem and
the next time the filesystem is mounted we remove such stray orphan items
at btrfs_find_orphan_roots().
However if the filesystem is mounted RO and we have stray orphan items for
any previously deleted root, we can end up leaking a transaction and other
data structures when unmounting the filesystem, as well as crashing if we
do not have hardware acceleration for crc32c available.
The steps that lead to the transaction leak are the following:
1) The filesystem is mounted in RW mode;
2) A subvolume is deleted;
3) When the cleaner kthread runs btrfs_drop_snapshot() to delete the root,
it gets a failure at btrfs_del_orphan_item(), which is ignored, due to
an ENOMEM when allocating a path for example. So the orphan item for
the root remains in the root tree;
4) The filesystem is unmounted;
5) The filesystem is mounted RO (-o ro). During the mount path we call
btrfs_find_orphan_roots(), which iterates the root tree searching for
orphan items. It finds the orphan item for our deleted root, and since
it can not find the root, it starts a transaction to delete the orphan
item (by calling btrfs_del_orphan_item());
6) The RO mount completes;
7) Before the transaction kthread commits the transaction created for
deleting the orphan item (i.e. less than 30 seconds elapsed since the
mount, the default commit interval), a filesystem unmount operation is
started;
8) At close_ctree(), we stop the transaction kthread, but we still have a
transaction open with at least one dirty extent buffer, a leaf for the
tree root which was COWed when deleting the orphan item;
9) We then proceed to destroy the work queues, free the roots and block
groups, etc. After that we drop the last reference on the btree inode by
calling iput() on it. Since there are dirty pages for the btree inode,
corresponding to the COWed extent buffer, btree_write_cache_pages() is
invoked to flush those dirty pages. This results in creating a bio and
submitting it, which makes us end up at btrfs_submit_metadata_bio();
10) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;
11) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.
When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:
------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [<0000000000000000>] 0x0
hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [<0000000000000000>] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0
And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:
stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---
Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:
=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1
So fix this by calling btrfs_find_orphan_roots() in the mount path only if
we are mounting the filesystem in RW mode. It's pointless to have it called
for RO mounts anyway, since despite adding any deleted roots to the list of
dead roots, we will never have the roots deleted until the filesystem is
remounted in RW mode, as the cleaner kthread does nothing when we are
mounted in RO - btrfs_need_cleaner_sleep() always returns true and the
cleaner spends all time sleeping, never cleaning dead roots.
This is accomplished by moving the call to btrfs_find_orphan_roots() from
open_ctree() to btrfs_start_pre_rw_mount(), which also guarantees that
if later the filesystem is remounted RW, we populate the list of dead
roots and have the cleaner task delete the dead roots.
Tested-by: Fabian Vogt <fvogt@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
cb13eea3b4 |
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
If we remount a filesystem in RO mode while the qgroup rescan worker is running, we can end up having it still running after the remount is done, and at unmount time we may end up with an open transaction that ends up never getting committed. If that happens we end up with several memory leaks and can crash when hardware acceleration is unavailable for crc32c. Possibly it can lead to other nasty surprises too, due to use-after-free issues. The following steps explain how the problem happens. 1) We have a filesystem mounted in RW mode and the qgroup rescan worker is running; 2) We remount the filesystem in RO mode, and never stop/pause the rescan worker, so after the remount the rescan worker is still running. The important detail here is that the rescan task is still running after the remount operation committed any ongoing transaction through its call to btrfs_commit_super(); 3) The rescan is still running, and after the remount completed, the rescan worker started a transaction, after it finished iterating all leaves of the extent tree, to update the qgroup status item in the quotas tree. It does not commit the transaction, it only releases its handle on the transaction; 4) A filesystem unmount operation starts shortly after; 5) The unmount task, at close_ctree(), stops the transaction kthread, which had not had a chance to commit the open transaction since it was sleeping and the commit interval (default of 30 seconds) has not yet elapsed since the last time it committed a transaction; 6) So after stopping the transaction kthread we still have the transaction used to update the qgroup status item open. At close_ctree(), when the filesystem is in RO mode and no transaction abort happened (or the filesystem is in error mode), we do not expect to have any transaction open, so we do not call btrfs_commit_super(); 7) We then proceed to destroy the work queues, free the roots and block groups, etc. After that we drop the last reference on the btree inode by calling iput() on it. Since there are dirty pages for the btree inode, corresponding to the COWed extent buffer for the quotas btree, btree_write_cache_pages() is invoked to flush those dirty pages. This results in creating a bio and submitting it, which makes us end up at btrfs_submit_metadata_bio(); 8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch that calls btrfs_wq_submit_bio(), because check_async_write() returned a value of 1. This value of 1 is because we did not have hardware acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not set in fs_info->flags; 9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the workqueue at fs_info->workers, which was already freed before by the call to btrfs_stop_all_workers() at close_ctree(). This results in an invalid memory access due to a use-after-free, leading to a crash. When this happens, before the crash there are several warnings triggered, since we have reserved metadata space in a block group, the delayed refs reservation, etc: ------------[ cut here ]------------ WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs] Code: f0 01 00 00 48 39 c2 75 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8 RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800 RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x17f/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 48 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c6 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs] Code: 48 83 bb b0 03 00 00 00 (...) RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206 RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_free_block_groups+0x24c/0x2f0 [btrfs] close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c7 ]--- ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Modules linked in: btrfs dm_snapshot dm_thin_pool (...) CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs] Code: ad de 49 be 22 01 00 (...) RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206 RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246 RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00 R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100 FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: close_ctree+0x2ba/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f15ee221ee7 Code: ff 0b 00 f7 d8 64 89 (...) RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000 RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0 R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000 R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last enabled at (0): [<ffffffff8bcae560>] copy_process+0x8a0/0x1d70 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace dd74718fef1ed5c8 ]--- BTRFS info (device sdc): space_info 4 has 268238848 free, is not full BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536 BTRFS info (device sdc): global_block_rsv: size 0 reserved 0 BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0 BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0 BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0 And the crash, which only happens when we do not have crc32c hardware acceleration, produces the following trace immediately after those warnings: stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs] Code: 54 55 53 48 89 f3 (...) RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282 RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000 RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0 RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8 R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000 FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: btrfs_wq_submit_bio+0xb3/0xd0 [btrfs] btrfs_submit_metadata_bio+0x44/0xc0 [btrfs] submit_one_bio+0x61/0x70 [btrfs] btree_write_cache_pages+0x414/0x450 [btrfs] ? kobject_put+0x9a/0x1d0 ? trace_hardirqs_on+0x1b/0xf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? free_debug_processing+0x1e1/0x2b0 do_writepages+0x43/0xe0 ? lock_acquired+0x199/0x490 __writeback_single_inode+0x59/0x650 writeback_single_inode+0xaf/0x120 write_inode_now+0x94/0xd0 iput+0x187/0x2b0 close_ctree+0x2c6/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f3cfebabee7 Code: ff 0b 00 f7 d8 64 89 01 (...) RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7 RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000 RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0 R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000 R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60 Modules linked in: btrfs dm_snapshot dm_thin_pool (...) ---[ end trace dd74718fef1ed5cc ]--- Finally when we remove the btrfs module (rmmod btrfs), there are several warnings about objects that were allocated from our slabs but were never freed, consequence of the transaction that was never committed and got leaked: ============================================================================= BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x0000000050cbdd61 @offset=12104 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x0000000086e9b0ff @offset=12776 INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs] commit_cowonly_roots+0x248/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x11/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 0b (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200 CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? lock_release+0x20e/0x4c0 kmem_cache_destroy+0x55/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000001a340018 @offset=4408 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_free_tree_block+0x128/0x360 [btrfs] __btrfs_cow_block+0x489/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] btrfs_commit_transaction+0x60/0xc40 [btrfs] create_subvol+0x56a/0x990 [btrfs] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] btrfs_ioctl+0x1a92/0x36f0 [btrfs] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 INFO: Object 0x000000002b46292a @offset=13648 INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] btrfs_alloc_tree_block+0x2bf/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 btrfs_delayed_ref_exit+0x1d/0x35 [btrfs] exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 ============================================================================= BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown() ----------------------------------------------------------------------------- INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200 CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 slab_err+0xb7/0xdc ? lock_acquired+0x199/0x490 __kmem_cache_shutdown+0x1ac/0x3c0 ? __mutex_unlock_slowpath+0x45/0x2a0 kmem_cache_destroy+0x55/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 f5 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 INFO: Object 0x000000004cf95ea8 @offset=6264 INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873 __slab_alloc.isra.0+0x109/0x1c0 kmem_cache_alloc+0x7bb/0x830 btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs] __btrfs_cow_block+0x12d/0x5f0 [btrfs] btrfs_cow_block+0xf7/0x220 [btrfs] btrfs_search_slot+0x62a/0xc40 [btrfs] btrfs_del_orphan_item+0x65/0xd0 [btrfs] btrfs_find_orphan_roots+0x1bf/0x200 [btrfs] open_ctree+0x125a/0x18a0 [btrfs] btrfs_mount_root.cold+0x13/0xed [btrfs] legacy_get_tree+0x30/0x60 vfs_get_tree+0x28/0xe0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803 kmem_cache_free+0x34c/0x3c0 __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] btrfs_run_delayed_refs+0x81/0x210 [btrfs] commit_cowonly_roots+0xfb/0x300 [btrfs] btrfs_commit_transaction+0x367/0xc40 [btrfs] close_ctree+0x113/0x2fa [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x31/0x70 cleanup_mnt+0x100/0x160 task_work_run+0x68/0xb0 exit_to_user_mode_prepare+0x1bb/0x1c0 syscall_exit_to_user_mode+0x4b/0x260 entry_SYSCALL_64_after_hwframe+0x44/0xa9 kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 kmem_cache_destroy+0x119/0x120 exit_btrfs_fs+0xa/0x59 [btrfs] __x64_sys_delete_module+0x194/0x260 ? fpregs_assert_state_consistent+0x1e/0x40 ? exit_to_user_mode_prepare+0x55/0x1c0 ? trace_hardirqs_on+0x1b/0xf0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f693e305897 Code: 73 01 c3 48 8b 0d f9 (...) RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0 RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897 RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8 RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000 R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740 R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760 BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1 Fix this issue by having the remount path stop the qgroup rescan worker when we are remounting RO and teach the rescan worker to stop when a remount is in progress. If later a remount in RW mode happens, we are already resuming the qgroup rescan worker through the call to btrfs_qgroup_rescan_resume(), so we do not need to worry about that. Tested-by: Fabian Vogt <fvogt@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8fc058597a |
btrfs: merge critical sections of discard lock in workfn
btrfs_discard_workfn() drops discard_ctl->lock just to take it again in a moment in btrfs_discard_schedule_work(). Avoid that and also reuse ktime. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1ea2872fc6 |
btrfs: fix racy access to discard_ctl data
Because only one discard worker may be running at any given point, it could have been safe to modify ->prev_discard, etc. without synchronization, if not for @override flag in btrfs_discard_schedule_work() and delayed_work_pending() returning false while workfn is running. That may lead to torn reads of u64 for some architectures, but that's not a big problem as only slightly affects the discard rate. Suggested-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ea9ed87c73 |
btrfs: fix async discard stall
Might happen that bg->discard_eligible_time was changed without rescheduling, so btrfs_discard_workfn() wakes up earlier than that new time, peek_discard_list() returns NULL, and all work halts and goes to sleep without further rescheduling even there are block groups to discard. It happens pretty often, but not so visible from the userspace because after some time it usually will be kicked off anyway by someone else calling btrfs_discard_reschedule_work(). Fix it by continue rescheduling if block group discard lists are not empty. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
675a4fc8f3 |
btrfs: tests: initialize test inodes location
I noticed that sometimes the module failed to load because the self tests failed like this: BTRFS: selftest: fs/btrfs/tests/inode-tests.c:963 miscount, wanted 1, got 0 This turned out to be because sometimes the btrfs ino would be the btree inode number, and thus we'd skip calling the set extent delalloc bit helper, and thus not adjust ->outstanding_extents. Fix this by making sure we initialize test inodes with a valid inode number so that we don't get random failures during self tests. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0b3f407e67 |
btrfs: send: fix wrong file path when there is an inode with a pending rmdir
When doing an incremental send, if we have a new inode that happens to have the same number that an old directory inode had in the base snapshot and that old directory has a pending rmdir operation, we end up computing a wrong path for the new inode, causing the receiver to fail. Example reproducer: $ cat test-send-rmdir.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT mkdir $MNT/dir touch $MNT/dir/file1 touch $MNT/dir/file2 touch $MNT/dir/file3 # Filesystem looks like: # # . (ino 256) # |----- dir/ (ino 257) # |----- file1 (ino 258) # |----- file2 (ino 259) # |----- file3 (ino 260) # btrfs subvolume snapshot -r $MNT $MNT/snap1 btrfs send -f /tmp/snap1.send $MNT/snap1 # Now remove our directory and all its files. rm -fr $MNT/dir # Unmount the filesystem and mount it again. This is to ensure that # the next inode that is created ends up with the same inode number # that our directory "dir" had, 257, which is the first free "objectid" # available after mounting again the filesystem. umount $MNT mount $DEV $MNT # Now create a new file (it could be a directory as well). touch $MNT/newfile # Filesystem now looks like: # # . (ino 256) # |----- newfile (ino 257) # btrfs subvolume snapshot -r $MNT $MNT/snap2 btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2 # Now unmount the filesystem, create a new one, mount it and try to apply # both send streams to recreate both snapshots. umount $DEV mkfs.btrfs -f $DEV >/dev/null mount $DEV $MNT btrfs receive -f /tmp/snap1.send $MNT btrfs receive -f /tmp/snap2.send $MNT umount $MNT When running the test, the receive operation for the incremental stream fails: $ ./test-send-rmdir.sh Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1' At subvol /mnt/sdi/snap1 Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2' At subvol /mnt/sdi/snap2 At subvol snap1 At snapshot snap2 ERROR: chown o257-9-0 failed: No such file or directory So fix this by tracking directories that have a pending rmdir by inode number and generation number, instead of only inode number. A test case for fstests follows soon. Reported-by: Massimo B. <massimo.b@gmx.net> Tested-by: Massimo B. <massimo.b@gmx.net> Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/ CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ae5e070eac |
btrfs: qgroup: don't try to wait flushing if we're already holding a transaction
There is a chance of racing for qgroup flushing which may lead to deadlock: Thread A | Thread B (not holding trans handle) | (holding a trans handle) --------------------------------+-------------------------------- __btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta() |- try_flush_qgroup() | |- try_flush_qgroup() |- QGROUP_FLUSHING bit set | | | | |- test_and_set_bit() | | |- wait_event() |- btrfs_join_transaction() | |- btrfs_commit_transaction()| !!! DEAD LOCK !!! Since thread A wants to commit transaction, but thread B is holding a transaction handle, blocking the commit. At the same time, thread B is waiting for thread A to finish its commit. This is just a hot fix, and would lead to more EDQUOT when we're near the qgroup limit. The proper fix would be to make all metadata/data reservations happen without holding a transaction handle. CC: stable@vger.kernel.org # 5.9+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9a66497156 |
btrfs: correctly calculate item size used when item key collision happens
Item key collision is allowed for some item types, like dir item and
inode refs, but the overall item size is limited by the nodesize.
item size(ins_len) passed from btrfs_insert_empty_items to
btrfs_search_slot already contains size of btrfs_item.
When btrfs_search_slot reaches leaf, we'll see if we need to split leaf.
The check incorrectly reports that split leaf is required, because
it treats the space required by the newly inserted item as
btrfs_item + item data. But in item key collision case, only item data
is actually needed, the newly inserted item could merge into the existing
one. No new btrfs_item will be inserted.
And split_leaf return EOVERFLOW from following code:
if (extend && data_size + btrfs_item_size_nr(l, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info))
return -EOVERFLOW;
In most cases, when callers receive EOVERFLOW, they either return
this error or handle in different ways. For example, in normal dir item
creation the userspace will get errno EOVERFLOW; in inode ref case
INODE_EXTREF is used instead.
However, this is not the case for rename. To avoid the unrecoverable
situation in rename, btrfs_check_dir_item_collision is called in
early phase of rename. In this function, when item key collision is
detected leaf space is checked:
data_size = sizeof(*di) + name_len;
if (data_size + btrfs_item_size_nr(leaf, slot) +
sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info))
the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here
refers to existing item size, the condition here correctly calculates
the needed size for collision case rather than the wrong case above.
The consequence of inconsistent condition check between
btrfs_check_dir_item_collision and btrfs_search_slot when item key
collision happens is that we might pass check here but fail
later at btrfs_search_slot. Rename fails and volume is forced readonly
[436149.586170] ------------[ cut here ]------------
[436149.586173] BTRFS: Transaction aborted (error -75)
[436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1
[436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs]
[436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286
[436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006
[436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0
[436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472
[436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08
[436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe
[436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000
[436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0
[436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[436149.586296] Call Trace:
[436149.586311] vfs_rename+0x383/0x920
[436149.586313] ? vfs_rename+0x383/0x920
[436149.586315] do_renameat2+0x4ca/0x590
[436149.586317] __x64_sys_rename+0x20/0x30
[436149.586324] do_syscall_64+0x5a/0x120
[436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[436149.586332] RIP: 0033:0x7fa3133b1d37
[436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37
[436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0
[436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0
[436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00
Thanks to Hans van Kranenburg for information about crc32 hash collision
tools, I was able to reproduce the dir item collision with following
python script.
https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run
it under a btrfs volume will trigger the abort transaction. It simply
creates files and rename them to forged names that leads to
hash collision.
There are two ways to fix this. One is to simply revert the patch
|
|
|
|
3d45f221ce |
btrfs: fix deadlock when cloning inline extent and low on free metadata space
When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:
* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;
* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;
* the async reclaim task blocks waiting for the delalloc work to complete;
* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;
* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.
When this happens, traces like the following show up in dmesg/syslog:
[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)
Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).
Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.
This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.
CC: stable@vger.kernel.org # 5.9+
Fixes:
|
|
|
|
ac7ac4618c |
for-5.11/block-2020-12-14
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl/Xec8QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpoLbEACzXypgZWwMdfgRckA/Vt333rXHtbhUV+hK
2XP+P81iRvr9Esi31UPbRp82vrgcDO0cpI1QmQojS5U5TIQP88BfXptfRZZu48eb
wT5RDDNQ34HItqAh/yEuYsv9yUKcxeIrB99tBVvM+4UmQg9zTdIW3mg6PvCBdbhV
N38jI0tCF/PJatjfRuphT/nXonQLPWBlVDmZk06KZQFOwQe9ep1vUi1+nbiRPuo3
geFBpTh1Kp6Vl1B3n4RpECs6Y7I0RRuJdaH2sDizICla1/BW91F9fQwHimNnUxUq
e1Q1kMuh6ftcQGkYlHSYcPhuv6CvorldTZCO5arPxWpcwvxriTSMRPWAgUr5pEiF
fhiGhqeDu9e6vl9vS31wUD1B30hy+jFz9wyjRrDwJ3cPHH1JVBjTzvdX+cIh/1ku
IbIwUMteUtvUrzqAv/DzbGhedp7xWtOFaVo8j0QFYh9zkjd6b8yDOF/yztwX2gjY
Xt1cd+KpDSiN449ZRaoMI0sCJAxqzhMa6nsWlb0L7KuNyWKAbvKQBm9Rb47FLV9A
Vx70KC+zkFoyw23capvIahmQazerriUJ5PGe0lVm6ROgmIFdCpXTPDjnrvq/6RZ/
GEpD7gTW9atGJ7EuEE8686sAfKD5kneChWLX5EHXf0d0AG5Mr2lKsluiGp5LpPJg
Q1Xqs6xwww==
=zo4w
-----END PGP SIGNATURE-----
Merge tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block
Pull block updates from Jens Axboe:
"Another series of killing more code than what is being added, again
thanks to Christoph's relentless cleanups and tech debt tackling.
This contains:
- blk-iocost improvements (Baolin Wang)
- part0 iostat fix (Jeffle Xu)
- Disable iopoll for split bios (Jeffle Xu)
- block tracepoint cleanups (Christoph Hellwig)
- Merging of struct block_device and hd_struct (Christoph Hellwig)
- Rework/cleanup of how block device sizes are updated (Christoph
Hellwig)
- Simplification of gendisk lookup and removal of block device
aliasing (Christoph Hellwig)
- Block device ioctl cleanups (Christoph Hellwig)
- Removal of bdget()/blkdev_get() as exported API (Christoph Hellwig)
- Disk change rework, avoid ->revalidate_disk() (Christoph Hellwig)
- sbitmap improvements (Pavel Begunkov)
- Hybrid polling fix (Pavel Begunkov)
- bvec iteration improvements (Pavel Begunkov)
- Zone revalidation fixes (Damien Le Moal)
- blk-throttle limit fix (Yu Kuai)
- Various little fixes"
* tag 'for-5.11/block-2020-12-14' of git://git.kernel.dk/linux-block: (126 commits)
blk-mq: fix msec comment from micro to milli seconds
blk-mq: update arg in comment of blk_mq_map_queue
blk-mq: add helper allocating tagset->tags
Revert "block: Fix a lockdep complaint triggered by request queue flushing"
nvme-loop: use blk_mq_hctx_set_fq_lock_class to set loop's lock class
blk-mq: add new API of blk_mq_hctx_set_fq_lock_class
block: disable iopoll for split bio
block: Improve blk_revalidate_disk_zones() checks
sbitmap: simplify wrap check
sbitmap: replace CAS with atomic and
sbitmap: remove swap_lock
sbitmap: optimise sbitmap_deferred_clear()
blk-mq: skip hybrid polling if iopoll doesn't spin
blk-iocost: Factor out the base vrate change into a separate function
blk-iocost: Factor out the active iocgs' state check into a separate function
blk-iocost: Move the usage ratio calculation to the correct place
blk-iocost: Remove unnecessary advance declaration
blk-iocost: Fix some typos in comments
blktrace: fix up a kerneldoc comment
block: remove the request_queue to argument request based tracepoints
...
|
|
|
|
f1ee3b8829 |
for-5.11-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/XdB4ACgkQxWXV+ddt
WDv41g//dOkrwjAVBfDUwRT/yKqojyEsZB1aNyHlPHFw8KEw5oIW7wxR4oqXi2ed
/i9KIJe4E9AfqAiexhLvA+Wyt/Sgwz+k4ys82PKhhRNQn7LE4tvhSBUu6JYJDU09
6I1jagya7ILa8akFXZTmVbXdliI4Ab+pcXWAmQYK/xPVDxYTSsBf4o4MilNBA9FS
lTwwBh5GTEtIkubr2yVd3pKfF4fT2g1hd+yglpHaOzpcrLMNN4hj4sUFlLbx/FlJ
MWo+914cSNKJoebbnqhK9djD9hggaaXnNooqfBOXUhZN0VN9rQoKb5tW+TREQmFm
shrmBSqN7CaqKfSOMZs7WOnTuTvmV/825PnLqDqcTUaLw+BgdyacpO9WflgfSs16
Cdvagr1SqbrSQ/3WYCpbqPLDNP3XuZ6+m5OWizf6fhyo8xdFcUHZgRC8qejDlycy
V/zP0c5OYOMi5vo6x/zhrD7Uft7xoFUVcSJCe8WPri082d9LbA2BqwCsullD60PQ
K/fsmlHs5Uxxy3MFgBPVDdWGgaa9rQ2vXequezbozBIIeeVL+Q9zkeyBFSYuFeE8
HToRE9B9BUEUh+p1JxPjOdFH/m+sKe1WMdmRLQthMzfOiNWW7pp/nL5rl4BUVmjm
58dQS73Cj/YNdBomRJXPPtgKIJPAWRrzU/JBcwAdMoKy57oh9NQ=
=5YAS
-----END PGP SIGNATURE-----
Merge tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"We have a mix of all kinds of changes, feature updates, core stuff,
performance improvements and lots of cleanups and preparatory changes.
User visible:
- export filesystem generation in sysfs
- new features for mount option 'rescue':
- what's currently supported is exported in sysfs
- 'ignorebadroots'/'ibadroots' - continue even if some essential
tree roots are not usable (extent, uuid, data reloc, device,
csum, free space)
- 'ignoredatacsums'/'idatacsums' - skip checksum verification on
data
- 'all' - now enables 'ignorebadroots' + 'ignoredatacsums' +
'nologreplay'
- export read mirror policy settings to sysfs, new policies will be
added in the future
- remove inode number cache feature (mount -o inode_cache), obsoleted
in 5.9
User visible fixes:
- async discard scheduling fixes on high loads
- update inode byte counter atomically so stat() does not report
wrong value in some cases
- free space tree fixes:
- correctly report status of v2 after remount
- clear v1 cache inodes when v2 is newly enabled after remount
Core:
- switch own tree lock implementation to standard rw semaphore:
- one-level lock nesting is not required anymore, the last use of
this was in free space that's now loaded asynchronously
- own implementation of adaptive spinning before taking mutex has
been part of rwsem
- performance seems to be better in general, much better (+tens
of percents) for some workloads
- lockdep does not complain
- finish direct IO conversion to iomap infrastructure, remove
temporary workaround for DSYNC after iomap API updates
- preparatory work to support data and metadata blocks smaller than
page:
- generalize code that assumes sectorsize == PAGE_SIZE, lots of
refactoring
- planned namely for 64K pages (eg. arm64, ppc64)
- scrub read-only support
- preparatory work for zoned allocation mode (SMR/ZBC/ZNS friendly):
- disable incompatible features
- round-robin superblock write
- free space cache (v1) is loaded asynchronously, remove tree path
recursion
- slightly improved time tacking for transaction kthread wake ups
Performance improvements (note that the numbers depend on load type or
other features and weren't run on the same machine):
- skip unnecessary work:
- do not start readahead for csum tree when scrubbing non-data
block groups
- do not start and wait for delalloc on snapshot roots on
transaction commit
- fix race when defragmenting leads to unnecessary IO
- dbench speedups (+throughput%/-max latency%):
- skip unnecessary searches for xattrs when logging an inode
(+10.8/-8.2)
- stop incrementing log batch when joining log transaction (1-2)
- unlock path before checking if extent is shared during nocow
writeback (+5.0/-20.5), on fio load +9.7% throughput/-9.8%
runtime
- several tree log improvements, eg. removing unnecessary
operations, fixing races that lead to additional work
(+12.7/-8.2)
- tree-checker error branches annotated with unlikely() (+3%
throughput)
Other:
- cleanups
- lockdep fixes
- more btrfs_inode conversions
- error variable cleanups"
* tag 'for-5.11-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (198 commits)
btrfs: scrub: allow scrub to work with subpage sectorsize
btrfs: scrub: support subpage data scrub
btrfs: scrub: support subpage tree block scrub
btrfs: scrub: always allocate one full page for one sector for RAID56
btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
btrfs: update num_extent_pages to support subpage sized extent buffer
btrfs: don't allow tree block to cross page boundary for subpage support
btrfs: calculate inline extent buffer page size based on page size
btrfs: factor out btree page submission code to a helper
btrfs: make btrfs_verify_data_csum follow sector size
btrfs: pass bio_offset to check_data_csum() directly
btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
btrfs: fix lockdep warning when creating free space tree
btrfs: skip space_cache v1 setup when not using it
btrfs: remove free space items when disabling space cache v1
btrfs: warn when remount will not change the free space tree
btrfs: use superblock state to print space_cache mount option
...
|
|
|
|
edd7ab7684 |
The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic implementation
which builds the base for the kmap_local() API and make the
kmap_atomic() interface wrappers which handle the disabling/enabling of
preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a mapping
is established. It has to disable migration instead to guarantee that
the virtual address of the mapped slot is the same accross preemption.
- Provide better debug facilities: guard pages and enforced utilization
of the mapping mechanics on 64bit systems when the architecture allows
it.
- Provide the new kmap_local() API which can now be used to cleanup the
kmap_atomic() usage sites all over the place. Most of the usage sites
do not require the implicit disabling of preemption and pagefaults so
the penalty on 64bit and 32bit non-highmem systems is removed and quite
some of the code can be simplified. A wholesale conversion is not
possible because some usage depends on the implicit side effects and
some need to be cleaned up because they work around these side effects.
The migrate disable side effect is only effective on highmem systems
and when enforced debugging is enabled. On 64bit and 32bit non-highmem
systems the overhead is completely avoided.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl/XyQwTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoUolD/9+R+BX96fGir+I8rG9dc3cbLw5meSi
0I/Nq3PToZMs2Iqv50DsoaPYHHz/M6fcAO9LRIgsE9jRbnY93GnsBM0wU9Y8yQaT
4wUzOG5WHaLDfqIkx/CN9coUl458oEiwOEbn79A2FmPXFzr7IpkufnV3ybGDwzwP
p73bjMJMPPFrsa9ig87YiYfV/5IAZHi82PN8Cq1v4yNzgXRP3Tg6QoAuCO84ZnWF
RYlrfKjcJ2xPdn+RuYyXolPtxr1hJQ0bOUpe4xu/UfeZjxZ7i1wtwLN9kWZe8CKH
+x4Lz8HZZ5QMTQ9sCHOLtKzu2MceMcpISzoQH4/aFQCNMgLn1zLbS790XkYiQCuR
ne9Cua+IqgYfGMG8cq8+bkU9HCNKaXqIBgPEKE/iHYVmqzCOqhW5Cogu4KFekf6V
Wi7pyyUdX2en8BAWpk5NHc8de9cGcc+HXMq2NIcgXjVWvPaqRP6DeITERTZLJOmz
XPxq5oPLGl7wdm7z+ICIaNApy8zuxpzb6sPLNcn7l5OeorViORlUu08AN8587wAj
FiVjp6ZYomg+gyMkiNkDqFOGDH5TMENpOFoB0hNNEyJwwS0xh6CgWuwZcv+N8aPO
HuS/P+tNANbD8ggT4UparXYce7YCtgOf3IG4GA3JJYvYmJ6pU+AZOWRoDScWq4o+
+jlfoJhMbtx5Gg==
=n71I
-----END PGP SIGNATURE-----
Merge tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull kmap updates from Thomas Gleixner:
"The new preemtible kmap_local() implementation:
- Consolidate all kmap_atomic() internals into a generic
implementation which builds the base for the kmap_local() API and
make the kmap_atomic() interface wrappers which handle the
disabling/enabling of preemption and pagefaults.
- Switch the storage from per-CPU to per task and provide scheduler
support for clearing mapping when scheduling out and restoring them
when scheduling back in.
- Merge the migrate_disable/enable() code, which is also part of the
scheduler pull request. This was required to make the kmap_local()
interface available which does not disable preemption when a
mapping is established. It has to disable migration instead to
guarantee that the virtual address of the mapped slot is the same
across preemption.
- Provide better debug facilities: guard pages and enforced
utilization of the mapping mechanics on 64bit systems when the
architecture allows it.
- Provide the new kmap_local() API which can now be used to cleanup
the kmap_atomic() usage sites all over the place. Most of the usage
sites do not require the implicit disabling of preemption and
pagefaults so the penalty on 64bit and 32bit non-highmem systems is
removed and quite some of the code can be simplified. A wholesale
conversion is not possible because some usage depends on the
implicit side effects and some need to be cleaned up because they
work around these side effects.
The migrate disable side effect is only effective on highmem
systems and when enforced debugging is enabled. On 64bit and 32bit
non-highmem systems the overhead is completely avoided"
* tag 'core-mm-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
ARM: highmem: Fix cache_is_vivt() reference
x86/crashdump/32: Simplify copy_oldmem_page()
io-mapping: Provide iomap_local variant
mm/highmem: Provide kmap_local*
sched: highmem: Store local kmaps in task struct
x86: Support kmap_local() forced debugging
mm/highmem: Provide CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP
mm/highmem: Provide and use CONFIG_DEBUG_KMAP_LOCAL
microblaze/mm/highmem: Add dropped #ifdef back
xtensa/mm/highmem: Make generic kmap_atomic() work correctly
mm/highmem: Take kmap_high_get() properly into account
highmem: High implementation details and document API
Documentation/io-mapping: Remove outdated blurb
io-mapping: Cleanup atomic iomap
mm/highmem: Remove the old kmap_atomic cruft
highmem: Get rid of kmap_types.h
xtensa/mm/highmem: Switch to generic kmap atomic
sparc/mm/highmem: Switch to generic kmap atomic
powerpc/mm/highmem: Switch to generic kmap atomic
nds32/mm/highmem: Switch to generic kmap atomic
...
|
|
|
|
b42fe98c92 |
btrfs: scrub: allow scrub to work with subpage sectorsize
Since btrfs scrub is utilizing its own infrastructure to submit read/write, scrub is independent from all other routines. This brings one very neat feature, allow us to read 4K data into offset 0 of a 64K page. So is the writeback routine. This makes scrub on subpage sector size much easier to implement, and thanks to previous commits which just changed the implementation to always do scrub based on sector size, now scrub can handle subpage filesystem without any problem. This patch will just remove the restriction on (sectorsize != PAGE_SIZE), to make scrub finally work on subpage filesystems. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b29dca44ab |
btrfs: scrub: support subpage data scrub
Btrfs scrub is more flexible than buffered data write path, as we can read an unaligned subpage data into page offset 0. This ability makes subpage support much easier, we just need to check each scrub_page::page_len and ensure we only calculate hash for [0, page_len) of a page. There is a small thing to notice: for subpage case, we still do sector by sector scrub. This means we will submit a read bio for each sector to scrub, resulting in the same amount of read bios, just like on the 4K page systems. This behavior can be considered as a good thing, if we want everything to be the same as 4K page systems. But this also means, we're wasting the possibility to submit larger bio using 64K page size. This is another problem to consider in the future. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
53f3251d3b |
btrfs: scrub: support subpage tree block scrub
To support subpage tree block scrub, scrub_checksum_tree_block() only needs to learn 2 new tricks: - Follow sector size Now scrub_page only represents one sector, we need to follow it properly. - Run checksum on all sectors Since scrub_page only represents one sector, we need to run checksum on all sectors, not only (nodesize >> PAGE_SIZE). Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d0a7a9c050 |
btrfs: scrub: always allocate one full page for one sector for RAID56
For scrub_pages() and scrub_pages_for_parity(), we currently allocate one scrub_page structure for one page. This is fine if we only read/write one sector one time. But for cases like scrubbing RAID56, we need to read/write the full stripe, which is in 64K size for now. For subpage size, we will submit the read in just one page, which is normally a good thing, but for RAID56 case, it only expects to see one sector, not the full stripe in its endio function. This could lead to wrong parity checksum for RAID56 on subpage. To make the existing code work well for subpage case, here we take a shortcut by always allocating a full page for one sector. This should provide the base to make RAID56 work for subpage case. The cost is pretty obvious now, for one RAID56 stripe now we always need 16 pages. For support subpage situation (64K page size, 4K sector size), this means we need full one megabyte to scrub just one RAID56 stripe. And for data scrub, each 4K sector will also need one 64K page. This is mostly just a workaround, the proper fix for this is a much larger project, using scrub_block to replace scrub_page, and allow scrub_block to handle multi pages, csums, and csum_bitmap to avoid allocating one page for each sector. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fa485d21a7 |
btrfs: scrub: reduce width of extent_len/stripe_len from 64 to 32 bits
Btrfs on-disk format chose to use u64 for almost everything, but there are a other restrictions that won't let us use more than u32 for things like extent length (the maximum length is 128MiB for non-hole extents), or stripe length (we have device number limit). This means if we don't have extra handling to convert u64 to u32, we will always have some questionable operations like "u32 = u64 >> sectorsize_bits" in the code. This patch will try to address the problem by reducing the width for the following members/parameters: - scrub_parity::stripe_len - @len of scrub_pages() - @extent_len of scrub_remap_extent() - @len of scrub_parity_mark_sectors_error() - @len of scrub_parity_mark_sectors_data() - @len of scrub_extent() - @len of scrub_pages_for_parity() - @len of scrub_extent_for_parity() For members extracted from on-disk structure, like map->stripe_len, they will be kept as is. Since that modification would require on-disk format change. There will be cases like "u32 = u64 - u64" or "u32 = u64", for such call sites, extra ASSERT() is added to be extra safe for debug builds. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6275193ef1 |
btrfs: refactor btrfs_lookup_bio_sums to handle out-of-order bvecs
Refactor btrfs_lookup_bio_sums() by:
- Remove the @file_offset parameter
There are two factors making the @file_offset parameter useless:
* For csum lookup in csum tree, file offset makes no sense
We only need disk_bytenr, which is unrelated to file_offset
* page_offset (file offset) of each bvec is not contiguous.
Pages can be added to the same bio as long as their on-disk bytenr
is contiguous, meaning we could have pages at different file offsets
in the same bio.
Thus passing file_offset makes no sense any more.
The only user of file_offset is for data reloc inode, we will use
a new function, search_file_offset_in_bio(), to handle it.
- Extract the csum tree lookup into search_csum_tree()
The new function will handle the csum search in csum tree.
The return value is the same as btrfs_find_ordered_sum(), returning
the number of found sectors which have checksum.
- Change how we do the main loop
The only needed info from bio is:
* the on-disk bytenr
* the length
After extracting the above info, we can do the search without bio
at all, which makes the main loop much simpler:
for (cur_disk_bytenr = orig_disk_bytenr;
cur_disk_bytenr < orig_disk_bytenr + orig_len;
cur_disk_bytenr += count * sectorsize) {
/* Lookup csum tree */
count = search_csum_tree(fs_info, path, cur_disk_bytenr,
search_len, csum_dst);
if (!count) {
/* Csum hole handling */
}
}
- Use single variable as the source to calculate all other offsets
Instead of all different type of variables, we use only one main
variable, cur_disk_bytenr, which represents the current disk bytenr.
All involved values can be calculated from that variable, and
all those variable will only be visible in the inner loop.
The above refactoring makes btrfs_lookup_bio_sums() way more robust than
it used to be, especially related to the file offset lookup. Now
file_offset lookup is only related to data reloc inode, otherwise we
don't need to bother file_offset at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
9e46458a7c |
btrfs: remove btrfs_find_ordered_sum call from btrfs_lookup_bio_sums
The function btrfs_lookup_bio_sums() is only called for read bios. While btrfs_find_ordered_sum() is to search ordered extent sums, which is only for write path. This means to read a page we either: - Submit read bio if it's not uptodate This means we only need to search csum tree for checksums. - The page is already uptodate It can be marked uptodate for previous read, or being marked dirty. As we always mark page uptodate for dirty page. In that case, we don't need to submit read bio at all, thus no need to search any checksums. Remove the btrfs_find_ordered_sum() call in btrfs_lookup_bio_sums(). And since btrfs_lookup_bio_sums() is the only caller for btrfs_find_ordered_sum(), also remove the implementation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
884b07d0f4 |
btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors
To support sectorsize < PAGE_SIZE case, we need to take extra care of extent buffer accessors. Since sectorsize is smaller than PAGE_SIZE, one page can contain multiple tree blocks, we must use eb->start to determine the real offset to read/write for extent buffer accessors. This patch introduces two helpers to do this: - get_eb_page_index() This is to calculate the index to access extent_buffer::pages. It's just a simple wrapper around "start >> PAGE_SHIFT". For sectorsize == PAGE_SIZE case, nothing is changed. For sectorsize < PAGE_SIZE case, we always get index as 0, and the existing page shift also works. - get_eb_offset_in_page() This is to calculate the offset to access extent_buffer::pages. This needs to take extent_buffer::start into consideration. For sectorsize == PAGE_SIZE case, extent_buffer::start is always aligned to PAGE_SIZE, thus adding extent_buffer::start to offset_in_page() won't change the result. For sectorsize < PAGE_SIZE case, adding extent_buffer::start gives us the correct offset to access. This patch will touch the following parts to cover all extent buffer accessors: - BTRFS_SETGET_HEADER_FUNCS() - read_extent_buffer() - read_extent_buffer_to_user() - memcmp_extent_buffer() - write_extent_buffer_chunk_tree_uuid() - write_extent_buffer_fsid() - write_extent_buffer() - memzero_extent_buffer() - copy_extent_buffer_full() - copy_extent_buffer() - memcpy_extent_buffer() - memmove_extent_buffer() - btrfs_get_token_##bits() - btrfs_get_##bits() - btrfs_set_token_##bits() - btrfs_set_##bits() - generic_bin_search() Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4a3dc93843 |
btrfs: update num_extent_pages to support subpage sized extent buffer
For subpage sized extent buffer, we have ensured no extent buffer will cross page boundary, thus we would only need one page for any extent buffer. Update function num_extent_pages to handle such case. Now num_extent_pages() returns 1 for subpage sized extent buffer. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1aaac38c83 |
btrfs: don't allow tree block to cross page boundary for subpage support
As a preparation for subpage sector size support (allowing filesystem with sector size smaller than page size to be mounted) if the sector size is smaller than page size, we don't allow tree block to be read if it crosses 64K(*) boundary. The 64K is selected because: - we are only going to support 64K page size for subpage for now - 64K is also the maximum supported node size This ensures that tree blocks are always contained in one page for a system with 64K page size, which can greatly simplify the handling. Otherwise we would have to do complex multi-page handling of tree blocks. Currently there is no way to create such tree blocks. In kernel we have avoided such tree blocks allocation even on 4K page size, as it can lead to RAID56 stripe scrubbing. While btrfs-progs have fixed its chunk allocator since 2016 for convert, and has extra checks to do the same behavior as the kernel. Just add such graceful checks in case of an ancient filesystem. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
deb6789553 |
btrfs: calculate inline extent buffer page size based on page size
Btrfs only support 64K as maximum node size, thus for 4K page system, we would have at most 16 pages for one extent buffer. For a system using 64K page size, we would really have just one page. While we always use 16 pages for extent_buffer::pages, this means for systems using 64K pages, we are wasting memory for 15 page pointers which will never be used. Calculate the array size based on page size and the node size maximum. - for systems using 4K page size, it will stay 16 pages - for systems using 64K page size, it will be 1 page Move the definition of BTRFS_MAX_METADATA_BLOCKSIZE to btrfs_tree.h, to avoid circular inclusion of ctree.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f91e0d0c4c |
btrfs: factor out btree page submission code to a helper
In btree_write_cache_pages() we have a btree page submission routine buried deeply in a nested loop. This patch will extract that part of code into a helper function, submit_eb_page(), to do the same work. Since submit_eb_page() now can return >0 for successful extent buffer submission, remove the "ASSERT(ret <= 0);" line. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f44cf41075 |
btrfs: make btrfs_verify_data_csum follow sector size
Currently btrfs_verify_data_csum() just passes the whole page to check_data_csum(), which is fine since we only support sectorsize == PAGE_SIZE. To support subpage, we need to properly honor per-sector checksum verification, just like what we did in dio read path. This patch will do the csum verification in a for loop, starts with pg_off == start - page_offset(page), with sectorsize increase for each loop. For sectorsize == PAGE_SIZE case, the pg_off will always be 0, and we will only loop once. For subpage case, we do the iterate over each sector and if we found any error, we return error. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7ffd27e378 |
btrfs: pass bio_offset to check_data_csum() directly
Parameter icsum for check_data_csum() is a little hard to understand. So is the phy_offset for btrfs_verify_data_csum(). Both parameters are calculated values for csum lookup. Instead of some calculated value, just pass bio_offset and let the final and only user, check_data_csum(), calculate whatever it needs. Since we are here, also make the bio_offset parameter and some related variables to be u32 (unsigned int). As bio size is limited by its bi_size, which is unsigned int, and has extra size limit check during various bio operations. Thus we are ensured that bio_offset won't overflow u32. Thus for all involved functions, not only rename the parameter from @phy_offset to @bio_offset, but also reduce its width to u32, so we won't have suspicious "u32 = u64 >> sector_bits;" lines anymore. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1941b64b08 |
btrfs: rename bio_offset of extent_submit_bio_start_t to dio_file_offset
The parameter bio_offset of extent_submit_bio_start_t is very confusing. If it's really bio_offset (offset to bio), then it should be u32. But in fact, it's only utilized by dio read, and that member is used as file offset, which must be u64. Rename it to dio_file_offset since the only user uses it as file offset, and add comment for who is using it. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8a6a87cd44 |
btrfs: fix lockdep warning when creating free space tree
A lock dependency loop exists between the root tree lock, the extent tree lock, and the free space tree lock. The root tree lock depends on the free space tree lock because btrfs_create_tree holds the new tree's lock while adding it to the root tree. The extent tree lock depends on the root tree lock because during umount, we write out space cache v1, which writes inodes in the root tree, which results in holding the root tree lock while doing a lookup in the extent tree. Finally, the free space tree depends on the extent tree because populate_free_space_tree holds a locked path in the extent tree and then does a lookup in the free space tree to add the new item. The simplest of the three to break is the one during tree creation: we unlock the leaf before inserting the tree node into the root tree, which fixes the lockdep warning. [30.480136] ====================================================== [30.480830] WARNING: possible circular locking dependency detected [30.481457] 5.9.0-rc8+ #76 Not tainted [30.481897] ------------------------------------------------------ [30.482500] mount/520 is trying to acquire lock: [30.483064] ffff9babebe03908 (btrfs-free-space-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 [30.484054] but task is already holding lock: [30.484637] ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 [30.485581] which lock already depends on the new lock. [30.486397] the existing dependency chain (in reverse order) is: [30.487205] -> #2 (btrfs-extent-01#2){++++}-{3:3}: [30.487825] down_read_nested+0x43/0x150 [30.488306] __btrfs_tree_read_lock+0x39/0x180 [30.488868] __btrfs_read_lock_root_node+0x3a/0x50 [30.489477] btrfs_search_slot+0x464/0x9b0 [30.490009] check_committed_ref+0x59/0x1d0 [30.490603] btrfs_cross_ref_exist+0x65/0xb0 [30.491108] run_delalloc_nocow+0x405/0x930 [30.491651] btrfs_run_delalloc_range+0x60/0x6b0 [30.492203] writepage_delalloc+0xd4/0x150 [30.492688] __extent_writepage+0x18d/0x3a0 [30.493199] extent_write_cache_pages+0x2af/0x450 [30.493743] extent_writepages+0x34/0x70 [30.494231] do_writepages+0x31/0xd0 [30.494642] __filemap_fdatawrite_range+0xad/0xe0 [30.495194] btrfs_fdatawrite_range+0x1b/0x50 [30.495677] __btrfs_write_out_cache+0x40d/0x460 [30.496227] btrfs_write_out_cache+0x8b/0x110 [30.496716] btrfs_start_dirty_block_groups+0x211/0x4e0 [30.497317] btrfs_commit_transaction+0xc0/0xba0 [30.497861] sync_filesystem+0x71/0x90 [30.498303] btrfs_remount+0x81/0x433 [30.498767] reconfigure_super+0x9f/0x210 [30.499261] path_mount+0x9d1/0xa30 [30.499722] do_mount+0x55/0x70 [30.500158] __x64_sys_mount+0xc4/0xe0 [30.500616] do_syscall_64+0x33/0x40 [30.501091] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [30.501629] -> #1 (btrfs-root-00){++++}-{3:3}: [30.502241] down_read_nested+0x43/0x150 [30.502727] __btrfs_tree_read_lock+0x39/0x180 [30.503291] __btrfs_read_lock_root_node+0x3a/0x50 [30.503903] btrfs_search_slot+0x464/0x9b0 [30.504405] btrfs_insert_empty_items+0x60/0xa0 [30.504973] btrfs_insert_item+0x60/0xd0 [30.505412] btrfs_create_tree+0x1b6/0x210 [30.505913] btrfs_create_free_space_tree+0x54/0x110 [30.506460] btrfs_mount_rw+0x15d/0x20f [30.506937] btrfs_remount+0x356/0x433 [30.507369] reconfigure_super+0x9f/0x210 [30.507868] path_mount+0x9d1/0xa30 [30.508264] do_mount+0x55/0x70 [30.508668] __x64_sys_mount+0xc4/0xe0 [30.509186] do_syscall_64+0x33/0x40 [30.509652] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [30.510271] -> #0 (btrfs-free-space-00){++++}-{3:3}: [30.510972] __lock_acquire+0x11ad/0x1b60 [30.511432] lock_acquire+0xa2/0x360 [30.511917] down_read_nested+0x43/0x150 [30.512383] __btrfs_tree_read_lock+0x39/0x180 [30.512947] __btrfs_read_lock_root_node+0x3a/0x50 [30.513455] btrfs_search_slot+0x464/0x9b0 [30.513947] search_free_space_info+0x45/0x90 [30.514465] __add_to_free_space_tree+0x92/0x39d [30.515010] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d [30.515639] btrfs_mount_rw+0x15d/0x20f [30.516142] btrfs_remount+0x356/0x433 [30.516538] reconfigure_super+0x9f/0x210 [30.517065] path_mount+0x9d1/0xa30 [30.517438] do_mount+0x55/0x70 [30.517824] __x64_sys_mount+0xc4/0xe0 [30.518293] do_syscall_64+0x33/0x40 [30.518776] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [30.519335] other info that might help us debug this: [30.520210] Chain exists of: btrfs-free-space-00 --> btrfs-root-00 --> btrfs-extent-01#2 [30.521407] Possible unsafe locking scenario: [30.522037] CPU0 CPU1 [30.522456] ---- ---- [30.522941] lock(btrfs-extent-01#2); [30.523311] lock(btrfs-root-00); [30.523952] lock(btrfs-extent-01#2); [30.524620] lock(btrfs-free-space-00); [30.525068] *** DEADLOCK *** [30.525669] 5 locks held by mount/520: [30.526116] #0: ffff9babebc520e0 (&type->s_umount_key#37){+.+.}-{3:3}, at: path_mount+0x7ef/0xa30 [30.527056] #1: ffff9babebc52640 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x3d5/0x5c0 [30.527960] #2: ffff9babeae8f2e8 (&cache->free_space_lock#2){+.+.}-{3:3}, at: btrfs_create_free_space_tree.cold.22+0x101/0x45d [30.529118] #3: ffff9babebe24468 (btrfs-extent-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 [30.530113] #4: ffff9babebd52eb8 (btrfs-extent-00){++++}-{3:3}, at: btrfs_try_tree_read_lock+0x16/0x100 [30.531124] stack backtrace: [30.531528] CPU: 0 PID: 520 Comm: mount Not tainted 5.9.0-rc8+ #76 [30.532166] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-4.module_el8.1.0+248+298dec18 04/01/2014 [30.533215] Call Trace: [30.533452] dump_stack+0x8d/0xc0 [30.533797] check_noncircular+0x13c/0x150 [30.534233] __lock_acquire+0x11ad/0x1b60 [30.534667] lock_acquire+0xa2/0x360 [30.535063] ? __btrfs_tree_read_lock+0x39/0x180 [30.535525] down_read_nested+0x43/0x150 [30.535939] ? __btrfs_tree_read_lock+0x39/0x180 [30.536400] __btrfs_tree_read_lock+0x39/0x180 [30.536862] __btrfs_read_lock_root_node+0x3a/0x50 [30.537304] btrfs_search_slot+0x464/0x9b0 [30.537713] ? trace_hardirqs_on+0x1c/0xf0 [30.538148] search_free_space_info+0x45/0x90 [30.538572] __add_to_free_space_tree+0x92/0x39d [30.539071] ? printk+0x48/0x4a [30.539367] btrfs_create_free_space_tree.cold.22+0x1ee/0x45d [30.539972] btrfs_mount_rw+0x15d/0x20f [30.540350] btrfs_remount+0x356/0x433 [30.540773] ? shrink_dcache_sb+0xd9/0x100 [30.541203] reconfigure_super+0x9f/0x210 [30.541642] path_mount+0x9d1/0xa30 [30.542040] do_mount+0x55/0x70 [30.542366] __x64_sys_mount+0xc4/0xe0 [30.542822] do_syscall_64+0x33/0x40 [30.543197] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [30.543691] RIP: 0033:0x7f109f7ab93a [30.546042] RSP: 002b:00007ffc47c4f858 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 [30.546770] RAX: ffffffffffffffda RBX: 00007f109f8cf264 RCX: 00007f109f7ab93a [30.547485] RDX: 0000557e6fc10770 RSI: 0000557e6fc19cf0 RDI: 0000557e6fc19cd0 [30.548185] RBP: 0000557e6fc10520 R08: 0000557e6fc18e30 R09: 0000557e6fc18cb0 [30.548911] R10: 0000000000200020 R11: 0000000000000246 R12: 0000000000000000 [30.549606] R13: 0000557e6fc19cd0 R14: 0000557e6fc10770 R15: 0000557e6fc10520 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
af456a2c0a |
btrfs: skip space_cache v1 setup when not using it
If we are not using space cache v1, we should not create the free space object or free space inodes. This comes up when we delete the existing free space objects/inodes when migrating to v2, only to see them get recreated for every dirtied block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
36b216c85e |
btrfs: remove free space items when disabling space cache v1
When the filesystem transitions from space cache v1 to v2 or to nospace_cache, it removes the old cached data, but does not remove the FREE_SPACE items nor the free space inodes they point to. This doesn't cause any issues besides being a bit inefficient, since these items no longer do anything useful. To fix it, when we are mounting, and plan to disable the space cache, destroy each block group's free space item and free space inode. The code to remove the items is lifted from the existing use case of removing the block group, with a light adaptation to handle whether or not we have already looked up the free space inode. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2838d255cb |
btrfs: warn when remount will not change the free space tree
If the remount is ro->ro, rw->ro, or rw->rw, we will not create or clear the free space tree. This can be surprising, so print a warning to dmesg to make the failure more visible. It is also important to ensure that the space cache options (SPACE_CACHE, FREE_SPACE_TREE) are consistent, so ensure those are set to properly match the current on disk state (which won't be changing). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
04c4155969 |
btrfs: use superblock state to print space_cache mount option
To make the contents of /proc/mounts better match the actual state of the filesystem, base the display of the space cache mount options off the contents of the super block rather than the last mount options passed in. Since there are many scenarios where the mount will ignore a space cache option, simply showing the passed in option is misleading. For example, if we mount with -o remount,space_cache=v2 on a read-write file system without an existing free space tree, we won't build a free space tree, but /proc/mounts will read space_cache=v2 (until we mount again and it goes away) cache_generation is set iff space_cache=v1, FREE_SPACE_TREE is set iff space_cache=v2, and if neither is the case, we print nospace_cache. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9484622945 |
btrfs: keep sb cache_generation consistent with space_cache
When mounting, btrfs uses the cache_generation in the super block to determine if space cache v1 is in use. However, by mounting with nospace_cache or space_cache=v2, it is possible to disable space cache v1, which does not result in un-setting cache_generation back to 0. In order to base some logic, like mount option printing in /proc/mounts, on the current state of the space cache rather than just the values of the mount option, keep the value of cache_generation consistent with the status of space cache v1. We ensure that cache_generation > 0 iff the file system is using space_cache v1. This requires committing a transaction on any mount which changes whether we are using v1. (v1->nospace_cache, v1->v2, nospace_cache->v1, v2->v1). Since the mechanism for writing out the cache generation is transaction commit, but we want some finer grained control over when we un-set it, we can't just rely on the SPACE_CACHE mount option, and introduce an fs_info flag that mount can use when it wants to unset the generation. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8b228324a8 |
btrfs: clear free space tree on ro->rw remount
A user might want to revert to v1 or nospace_cache on a root filesystem, and much like turning on the free space tree, that can only be done remounting from ro->rw. Support clearing the free space tree on such mounts by moving it into the shared remount logic. Since the CLEAR_CACHE option sticks around across remounts, this change would result in clearing the tree for ever on every remount, which is not desirable. To fix that, add CLEAR_CACHE to the oneshot options we clear at mount end, which has the other bonus of not cluttering the /proc/mounts output with clear_cache. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8cd2908846 |
btrfs: clear oneshot options on mount and remount
Some options only apply during mount time and are cleared at the end of mount. For now, the example is USEBACKUPROOT, but CLEAR_CACHE also fits the bill, and this is a preparation patch for also clearing that option. One subtlety is that the current code only resets USEBACKUPROOT on rw mounts, but the option is meaningfully "consumed" by a ro mount, so it feels appropriate to clear in that case as well. A subsequent read-write remount would not go through open_ctree, which is the only place that checks the option, so the change should be benign. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5011139a47 |
btrfs: create free space tree on ro->rw remount
When a user attempts to remount a btrfs filesystem with 'mount -o remount,space_cache=v2', that operation silently succeeds. Unfortunately, this is misleading, because the remount does not create the free space tree. /proc/mounts will incorrectly show space_cache=v2, but on the next mount, the file system will revert to the old space_cache. For now, we handle only the easier case, where the existing mount is read-only and the new mount is read-write. In that case, we can create the free space tree without contending with the block groups changing as we go. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
997e3e2e71 |
btrfs: only mark bg->needs_free_space if free space tree is on
If we attempt to create a free space tree while any block groups have needs_free_space set, we will double add the new free space item and hit EEXIST. Previously, we only created the free space tree on a new mount, so we never hit the case, but if we try to create it on a remount, such block groups could exist and trip us up. We don't do anything with this field unless the free space tree is enabled, so there is no harm in not setting it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8f1c21d749 |
btrfs: start orphan cleanup on ro->rw remount
When we mount a rw filesystem, we start the orphan cleanup process in tree root and filesystem tree. However, when we remount a ro file system rw, we only clean the former. Move the calls to btrfs_orphan_cleanup() on tree_root and fs_root to the shared rw mount routine to effectively add them on ro->rw remount. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
44c0ca211a |
btrfs: lift read-write mount setup from mount and remount
Mounting rw and remounting from ro to rw naturally share invariants and functionality which result in a correctly setup rw filesystem. Luckily, there is even a strong unity in the code which implements them. In mount's open_ctree, these operations mostly happen after an early return for ro file systems, and in remount, they happen in a section devoted to remounting ro->rw, after some remount specific validation passes. However, there are unfortunately a few differences. There are small deviations in the order of some of the operations, remount does not start orphan cleanup in root_tree or fs_tree, remount does not create the free space tree, and remount does not handle "one-shot" mount options like clear_cache and uuid tree rescan. Since we want to add building the free space tree to remount, and also to start the same orphan cleanup process on a filesystem mounted as ro then remounted rw, we would benefit from unifying the logic between the two code paths. This patch only lifts the existing common functionality, and leaves a natural path for fixing the discrepancies. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
47876f7cef |
btrfs: do not block inode logging for so long during transaction commit
Early on during a transaction commit we acquire the tree_log_mutex and
hold it until after we write the super blocks. But before writing the
extent buffers dirtied by the transaction and the super blocks we unblock
the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting
fs_info->running_transaction to NULL.
This means that after that and before writing the super blocks, new
transactions can start. However if any transaction wants to log an inode,
it will block waiting for the transaction commit to write its dirty
extent buffers and the super blocks because the tree_log_mutex is only
released after those operations are complete, and starting a new log
transaction blocks on that mutex (at start_log_trans()).
Writing the dirty extent buffers and the super blocks can take a very
significant amount of time to complete, but we could allow the tasks
wanting to log an inode to proceed with most of their steps:
1) create the log trees
2) log metadata in the trees
3) write their dirty extent buffers
They only need to wait for the previous transaction commit to complete
(write its super blocks) before they attempt to write their super blocks,
otherwise we could end up with a corrupt filesystem after a crash.
So change start_log_trans() to use the root tree's log_mutex to serialize
for the creation of the log root tree instead of using the tree_log_mutex,
and make btrfs_sync_log() acquire the tree_log_mutex before writing the
super blocks. This allows for inode logging to wait much less time when
there is a previous transaction that is still committing, often not having
to wait at all, as by the time when we try to sync the log the previous
transaction already wrote its super blocks.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
The following script that uses dbench was used to measure the impact of
the whole patchset:
$ cat test-dbench.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd"
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -D $MNT -t 300 64
umount $MNT
The test was run on a machine with 12 cores, 64G of ram, using a NVMe
device and a non-debug kernel configuration (Debian's default).
Before patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 11277211 0.250 85.340
Close 8283172 0.002 6.479
Rename 477515 1.935 86.026
Unlink 2277936 0.770 87.071
Deltree 256 15.732 81.379
Mkdir 128 0.003 0.009
Qpathinfo 10221180 0.056 44.404
Qfileinfo 1789967 0.002 4.066
Qfsinfo 1874399 0.003 9.176
Sfileinfo 918589 0.061 10.247
Find 3951758 0.341 54.040
WriteX 5616547 0.047 85.079
ReadX 17676028 0.005 9.704
LockX 36704 0.003 1.800
UnlockX 36704 0.002 0.687
Flush 790541 14.115 676.236
Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms
After patch set:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 12687926 0.171 86.526
Close 9320780 0.002 8.063
Rename 537253 1.444 78.576
Unlink 2561827 0.559 87.228
Deltree 374 11.499 73.549
Mkdir 187 0.003 0.005
Qpathinfo 11500300 0.061 36.801
Qfileinfo 2017118 0.002 7.189
Qfsinfo 2108641 0.003 4.825
Sfileinfo 1033574 0.008 8.065
Find 4446553 0.408 47.835
WriteX 6335667 0.045 84.388
ReadX 19887312 0.003 9.215
LockX 41312 0.003 1.394
UnlockX 41312 0.002 1.425
Flush 889233 13.014 623.259
Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms
+12.7% throughput, -8.2% max latency
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
639bd575b7 |
btrfs: fix race leading to unnecessary transaction commit when logging inode
When logging an inode we may often have to fallback to a full transaction
commit, either because a new block group was allocated, there is some case
we can not deal with without a transaction commit or some error like an
ENOMEM happened. However after we fallback to a transaction commit, we
have a time window where we can make the next attempt to log any inode
commit the next transaction unnecessarily, adding additional overhead and
increasing latency.
A sequence of steps that leads to this issue is the following:
1) The current open transaction has a generation of 1000;
2) A new block group is allocated, and as a consequence we must make sure
any attempts to commit a log fallback to a transaction commit, so
btrfs_set_log_full_commit() is called from btrfs_make_block_group().
This sets fs_info->last_trans_log_full_commit to 1000;
3) Task A is holding a handle on transaction 1000 and tries to log inode X.
Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit()
which returns true, since fs_info->last_trans_log_full_commit has a
value of 1000. So we end up returning EAGAIN and propagating it up to
btrfs_sync_file(), where we commit transaction 1000;
4) The transaction commit task (task A) sets the transaction state to
unblocked (TRANS_STATE_UNBLOCKED);
5) Some other task, task B, starts a new transaction with a generation of
1001;
6) Some stuff is done with transaction 1001, some btree blocks COWed, etc;
7) Transaction 1000 has not fully committed yet, we are still writing all
the extent buffers it created;
8) Some new task, task C, starts an fsync of inode Y, gets a handle for
transaction 1001, and it gets to btrfs_log_inode_parent() which does
the following check:
if (fs_info->last_trans_log_full_commit > last_committed) {
ret = 1;
goto end_no_trans;
}
At that point last_trans_log_full_commit has a value of 1000 and
last_committed (value of fs_info->last_trans_committed) has a value of
999, since transaction 1000 has not yet committed - it is either still
writing out dirty extent buffers, its super blocks or unpinning
extents.
As a consequence we return 1, which gets propagated up to
btrfs_sync_file(), which will then call btrfs_commit_transaction()
for transaction 1001.
As a consequence we have an unnecessary second transaction commit, we
previously committed transaction 1000 and now commit transaction 1001
as well, resulting in more overhead and increased latency.
So fix this double transaction commit issue simply by removing that check,
because all we need to do is wait for the previous transaction to finish
its commit, which we already do later when starting the log transaction at
start_log_trans(), because there we acquire the tree_log_mutex lock, which
is held by a transaction commit and only released after the transaction
commits its super blocks.
Another issue that check has is that it reads last_trans_log_full_commit
without using READ_ONCE(), which is incorrect since that member of
struct btrfs_fs_info is always updated with WRITE_ONCE() through the
helper btrfs_set_log_full_commit().
This double transaction commit issue can actually be triggered quite often
in long runs of dbench, since besides the creation of new block groups
that force inode logging to fallback to a transaction commit, there are
cases where dbench asks to fsync a directory which had files in it that
were previously renamed or subdirectories that were removed, resulting in
the inode logging to fallback to a full transaction commit.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
47d3db41e1 |
btrfs: fix race that makes inode logging fallback to transaction commit
When logging an inode and the previous transaction is still committing, we
have a time window where we can end up incorrectly think an inode has its
last_unlink_trans field with a value greater than the last transaction
committed, which results in the logging to fallback to a full transaction
commit, which is usually much more expensive than doing a log commit.
The race is described by the following steps:
1) We are at transaction 1000;
2) We modify an inode X (a directory) using transaction 1000 and set its
last_unlink_trans field to 1000, because for example we removed one
of its subdirectories;
3) We create a new inode Y with a dentry in inode X using transaction 1000,
so its generation field is set to 1000;
4) The commit for transaction 1000 is started by task A;
5) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
6) Some task starts a new transaction with a generation of 1001;
7) We do some modification to inode Y (using transaction 1001);
8) The transaction 1000 commit starts unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
9) Task B starts an fsync on inode Y, and gets a handle for transaction
1001. When it gets to check_parent_dirs_for_sync() it does the checking
of the ancestor dentries because the following check does not evaluate
to true:
if (S_ISREG(inode->vfs_inode.i_mode) &&
inode->generation <= last_committed &&
inode->last_unlink_trans <= last_committed)
goto out;
The generation value for inode Y is 1000 and last_committed, which has
the value read from fs_info->last_trans_committed, has a value of 999,
so that check evaluates to false and we proceed to check the ancestor
inodes.
Once we get to the first ancestor, inode X, we call
btrfs_must_commit_transaction() on it, which evaluates to true:
static bool btrfs_must_commit_transaction(...)
{
struct btrfs_fs_info *fs_info = inode->root->fs_info;
bool ret = false;
mutex_lock(&inode->log_mutex);
if (inode->last_unlink_trans > fs_info->last_trans_committed) {
/*
* Make sure any commits to the log are forced to be full
* commits.
*/
btrfs_set_log_full_commit(trans);
ret = true;
}
(...)
because inode's X last_unlink_trans has a value of 1000 and
fs_info->last_trans_committed still has a value of 999, it returns
true to check_parent_dirs_for_sync(), making it return 1 which is
propagated up to btrfs_sync_file(), causing it to fallback to a full
transaction commit of transaction 1001.
We should have not fallen back to commit transaction 1001, since inode
X had last_unlink_trans set to 1000 and the super blocks for
transaction 1000 were already written. So while not resulting in a
functional problem, it leads to a lot more work and higher latencies
for a fsync since committing a transaction is usually more expensive
than committing a log (if other filesystem changes happened under that
transaction).
Similar problem happens when logging directories, for the same reason as
btrfs_must_commit_transaction() returns true on an inode with its
last_unlink_trans having the generation of the previous transaction and
that transaction is still committing, unpinning its freed extents.
So fix this by comparing last_unlink_trans with the id of the current
transaction instead of fs_info->last_trans_committed.
This case is often hit when running dbench for a long enough duration, as
it does lots of rename and rmdir operations (both update the field
last_unlink_trans of an inode) and fsyncs of files and directories.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
4d6221d7d8 |
btrfs: fix race that causes unnecessary logging of ancestor inodes
When logging an inode and we are checking if we need to log ancestors that
are new, if the previous transaction is still committing we have a time
window where we can unnecessarily log ancestor inodes that were created in
the previous transaction.
The race is described by the following steps:
1) We are at transaction 1000;
2) Directory inode X is created, its generation is set to 1000;
3) The commit for transaction 1000 is started by task A;
4) The task committing transaction 1000 sets the transaction state to
unblocked, writes the dirty extent buffers and the super blocks, then
unlocks tree_log_mutex;
5) Inode Y, a regular file, is created under directory inode X, this
results in starting a new transaction with a generation of 1001;
6) The transaction 1000 commit is unpinning extents. At this point
fs_info->last_trans_committed still has a value of 999;
7) Task B calls fsync on inode Y and gets a handle for transaction 1001;
8) Task B ends up at log_all_new_ancestors() and then because inode Y has
only one hard link, ends up at log_new_ancestors_fast(). There it reads
a value of 999 from fs_info->last_trans_committed, and sees that the
parent inode X has a generation of 1000, so we end up logging inode X:
if (inode->generation > fs_info->last_trans_committed) {
ret = btrfs_log_inode(trans, root, inode,
LOG_INODE_EXISTS, ctx);
(...)
which is not necessary since it was created in the past transaction,
with a generation of 1000, and that transaction has already committed
its super blocks - it's still unpinning extents so it has not yet
updated fs_info->last_trans_committed from 999 to 1000.
So this just causes us to spend more time logging and allocating and
writing more tree blocks for the log tree.
So fix this by comparing an inode's generation with the generation of the
transaction our transaction handle refers to - if the inode's generation
matches the generation of the current transaction than we know it is a
new inode we need to log, otherwise don't log it.
This case is often hit when running dbench for a long enough duration.
This patch belongs to a patch set that is comprised of the following
patches:
btrfs: fix race causing unnecessary inode logging during link and rename
btrfs: fix race that results in logging old extents during a fast fsync
btrfs: fix race that causes unnecessary logging of ancestor inodes
btrfs: fix race that makes inode logging fallback to transaction commit
btrfs: fix race leading to unnecessary transaction commit when logging inode
btrfs: do not block inode logging for so long during transaction commit
Performance results are mentioned in the change log of the last patch.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
5f96bfb763 |
btrfs: fix race that results in logging old extents during a fast fsync
When logging the extents of an inode during a fast fsync, we have a time window where we can log extents that are from the previous transaction and already persisted. This only makes us waste time unnecessarily. The following sequence of steps shows how this can happen: 1) We are at transaction 1000; 2) An ordered extent E from inode I completes, that is it has gone through btrfs_finish_ordered_io(), and it set the extent maps' generation to 1000 when we unpin the extent, which is the generation of the current transaction; 3) The commit for transaction 1000 starts by task A; 4) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 5) Some change is made to inode I, resulting in creation of a new transaction with a generation of 1001; 6) The transaction 1000 commit starts unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 7) Task B starts an fsync on inode I, and when it gets to btrfs_log_changed_extents() sees the extent map for extent E in the list of modified extents. It sees the extent map has a generation of 1000 and fs_info->last_trans_committed has a value of 999, so it proceeds to logging the respective file extent item and all the checksums covering its range. So we end up wasting time since the extent was already persisted and is reachable through the trees pointed to by the super block committed by transaction 1000. So just fix this by comparing the extent maps generation against the generation of the transaction handle - if it is smaller then the id in the handle, we know the extent was already persisted and we do not need to log it. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
de53d892e5 |
btrfs: fix race causing unnecessary inode logging during link and rename
When we are doing a rename or a link operation for an inode that was logged in the previous transaction and that transaction is still committing, we have a time window where we incorrectly consider that the inode was logged previously in the current transaction and therefore decide to log it to update it in the log. The following steps give an example on how this happens during a link operation: 1) Inode X is logged in transaction 1000, so its logged_trans field is set to 1000; 2) Task A starts to commit transaction 1000; 3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED; 4) Task B starts a link operation for inode X, and as a consequence it starts transaction 1001; 5) Task A is still committing transaction 1000, therefore the value stored at fs_info->last_trans_committed is still 999; 6) Task B calls btrfs_log_new_name(), it reads a value of 999 from fs_info->last_trans_committed and because the logged_trans field of inode X has a value of 1000, the function does not return immediately, instead it proceeds to logging the inode, which should not happen because the inode was logged in the previous transaction (1000) and not in the current one (1001). This is not a functional problem, just wasted time and space logging an inode that does not need to be logged, contributing to higher latency for link and rename operations. So fix this by comparing the inodes' logged_trans field with the generation of the current transaction instead of comparing with the value stored in fs_info->last_trans_committed. This case is often hit when running dbench for a long enough duration, as it does lots of rename operations. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fa598b0696 |
btrfs: remove recalc_thresholds from free space ops
After removing the inode number cache that was using the free space cache code, we can remove at least the recalc_thresholds callback from the ops. Both code and tests use the same callback function. It's moved before its first use. The use_bitmaps callback is still needed by tests to create some extents/bitmap setup. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f0d1219def |
btrfs: always set NODATASUM/NODATACOW in __create_free_space_inode
Since it's being used solely for the freespace cache unconditionally set the flags required for it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7dbdb443a7 |
btrfs: remove crc_check logic from free space
Following removal of the ino cache io_ctl_init will be called only on behalf of the freespace inode. In this case we always want to check CRCs so conditional code that depended on io_ctl::check_crc can be removed. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5297199a8b |
btrfs: remove inode number cache feature
It's been deprecated since commit
|
|
|
|
abadc1fcd7 |
btrfs: replace calls to btrfs_find_free_ino with btrfs_find_free_objectid
The former is going away as part of the inode map removal so switch callers to btrfs_find_free_objectid. No functional changes since with INODE_MAP disabled (default) find_free_objectid was called anyway. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ec7d6dfd73 |
btrfs: move btrfs_find_highest_objectid/btrfs_find_free_objectid to disk-io.c
Those functions are going to be used even after inode cache is removed so moved them to a more appropriate place. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1201b58b67 |
btrfs: drop casts of bio bi_sector
Since commit
|
|
|
|
12659251ca |
btrfs: implement log-structured superblock for ZONED mode
Superblock (and its copies) is the only data structure in btrfs which has a fixed location on a device. Since we cannot overwrite in a sequential write required zone, we cannot place superblock in the zone. One easy solution is limiting superblock and copies to be placed only in conventional zones. However, this method has two downsides: one is reduced number of superblock copies. The location of the second copy of superblock is 256GB, which is in a sequential write required zone on typical devices in the market today. So, the number of superblock and copies is limited to be two. Second downside is that we cannot support devices which have no conventional zones at all. To solve these two problems, we employ superblock log writing. It uses two adjacent zones as a circular buffer to write updated superblocks. Once the first zone is filled up, start writing into the second one. Then, when both zones are filled up and before starting to write to the first zone again, it reset the first zone. We can determine the position of the latest superblock by reading write pointer information from a device. One corner case is when both zones are full. For this situation, we read out the last superblock of each zone, and compare them to determine which zone is older. The following zones are reserved as the circular buffer on ZONED btrfs. - The primary superblock: zones 0 and 1 - The first copy: zones 16 and 17 - The second copy: zones 1024 or zone at 256GB which is minimum, and next to it If these reserved zones are conventional, superblock is written fixed at the start of the zone without logging. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a589dde0bc |
btrfs: disallow mixed-bg in ZONED mode
Placing both data and metadata in a block group is impossible in ZONED mode. For data, we can allocate a space for it and write it immediately after the allocation. For metadata, however, we cannot do that, because the logical addresses are recorded in other metadata buffers to build up the trees. As a result, a data buffer can be placed after a metadata buffer, which is not written yet. Writing out the data buffer will break the sequential write rule. Check and disallow MIXED_BG with ZONED mode. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f1569c4c10 |
btrfs: disable fallocate in ZONED mode
fallocate() is implemented by reserving actual extent instead of reservations. This can result in exposing the sequential write constraint of host-managed zoned block devices to the application, which would break the POSIX semantic for the fallocated file. To avoid this, report fallocate() as not supported when in ZONED mode for now. In the future, we may be able to implement "in-memory" fallocate() in ZONED mode by utilizing space_info->bytes_may_use or similar, so this returns EOPNOTSUPP. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d206e9c9c5 |
btrfs: disallow NODATACOW in ZONED mode
NODATACOW implies overwriting the file data on a device, which is impossible in sequential required zones. Disable NODATACOW globally with mount option and per-file NODATACOW attribute by masking FS_NOCOW_FL. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5d1ab66c56 |
btrfs: disallow space_cache in ZONED mode
As updates to the space cache v1 are in-place, the space cache cannot be located over sequential zones and there is no guarantees that the device will have enough conventional zones to store this cache. Resolve this problem by disabling completely the space cache v1. This does not introduce any problems with sequential block groups: all the free space is located after the allocation pointer and no free space before the pointer. There is no need to have such cache. Note: we can technically use free-space-tree (space cache v2) on ZONED mode. But, since ZONED mode now always allocates extents in a block group sequentially regardless of underlying device zone type, it's no use to enable and maintain the tree. For the same reason, NODATACOW is also disabled. In summary, ZONED will disable: | Disabled features | Reason | |-------------------+-----------------------------------------------------| | RAID/DUP | Cannot handle two zone append writes to different | | | zones | |-------------------+-----------------------------------------------------| | space_cache (v1) | In-place updating | | NODATACOW | In-place updating | |-------------------+-----------------------------------------------------| | fallocate | Reserved extent will be a write hole | |-------------------+-----------------------------------------------------| | MIXED_BG | Allocated metadata region will be write holes for | | | data writes | Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
862931c763 |
btrfs: introduce max_zone_append_size
The zone append write command has a maximum IO size restriction it accepts. This is because a zone append write command cannot be split, as we ask the device to place the data into a specific target zone and the device responds with the actual written location of the data. Introduce max_zone_append_size to zone_info and fs_info to track the value, so we can limit all I/O to a zoned block device that we want to write using the zone append command to the device's limits. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b70f509774 |
btrfs: check and enable ZONED mode
Introduce function btrfs_check_zoned_mode() to check if ZONED flag is enabled on the file system and if the file system consists of zoned devices with equal zone size. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5b31646898 |
btrfs: get zone information of zoned block devices
If a zoned block device is found, get its zone information (number of zones and zone size). To avoid costly run-time zone report commands to test the device zones type during block allocation, attach the seq_zones bitmap to the device structure to indicate if a zone is sequential or accept random writes. Also it attaches the empty_zones bitmap to indicate if a zone is empty or not. This patch also introduces the helper function btrfs_dev_is_sequential() to test if the zone storing a block is a sequential write required zone and btrfs_dev_is_empty_zone() to test if the zone is a empty zone. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7b3d5a90cb |
btrfs: introduce ZONED feature flag
This patch introduces the ZONED incompat flag. The flag indicates that the volume management will satisfy the constraints imposed by host-managed zoned block devices (aligned chunk allocation, append-only updates, reset zone after filled). As the zoned support will happen incrementally due to enhancing some core infrastructure like super block writes, tree-log, raid support, the feature will appear in sysfs only on debug builds. It will be enabled once the support is feature complete and applications can reliably check whether zoned support is present or not. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a2633b6a29 |
btrfs: return bool from btrfs_should_end_transaction
Results in slightly smaller code. add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-11 (-11) Function old new delta btrfs_should_end_transaction 96 85 -11 Total: Before=20070, After=20059, chg -0.05% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8a8f4deaba |
btrfs: return bool from should_end_transaction
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8df01fddb7 |
btrfs: remove err variable from do_relocation
It simply gets assigned to 'ret' in case of errors. The flow of the while loop is not changed by this commit since the few call sites that 'goto next' will simply break from the loop. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c6a592f2e2 |
btrfs: eliminate err variable from merge_reloc_root
In most cases when an error is returned from a function 'ret' is simply assigned to 'err'. There is only one case where walk_up_reloc_tree can return a positive value - in this case the code breaks from the loop and ret is going to get its return value from btrfs_cow_block - either 0 or negative. This retains the old logic of how 'err' used to be set at this call site. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ee0d904fd9 |
btrfs: remove err variable from btrfs_delete_subvolume
Use only a single 'ret' to control whether we should abort the transaction or not. That's fine, because if we abort a transaction then btrfs_end_transaction will return the same value as passed to btrfs_abort_transaction. No semantic changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c65ca98f9e |
btrfs: unlock path before checking if extent is shared during nocow writeback
When we are attempting to start writeback for an existing extent in NOCOW mode, at run_delalloc_nocow(), we must check if the extent is shared, and if it is, fallback to a COW write. However we do such check while still holding a read lock on the leaf that contains the file extent item, and that check, the call to btrfs_cross_ref_exist(), can take some time because: 1) It needs to do a search on the extent tree, which obviously takes some time, specially if delayed references are being run at the moment, as we can block when trying to lock currently write locked btree nodes; 2) It needs to check the delayed references for any existing reference for our data extent, this requires acquiring the delayed references' spinlock and maybe block on the mutex of a delayed reference head in the case where there is a delayed reference for our data extent, in the worst case it makes us release the path on the extent tree and retry the whole process again (going back to step 1). There are other operations we do while holding the leaf locked that can take some significant time as well (specially all together): * btrfs_extent_readonly() - to check if the block group containing the extent is currently in RO mode. This requires taking a spinlock and searching for the block group in a rbtree that can be big on large filesystems; * csum_exist_in_range() - to search if there are any checksums in the csum tree for the extent. Like before, this can take some time if we are in a filesystem that has both COW and NOCOW files, in which case the csum tree is not empty; * btrfs_inc_nocow_writers() - increment the number of nocow writers in the block group that contains the data extent. Needs to acquire a spinlock and search for the block group in a rbtree that can be big on large filesystems. So just unlock the leaf (release the path) before doing all those checks, since we do not need it anymore. In case we can not do a NOCOW write for the extent, due to any of those checks failing, and the writeback range goes beyond that extents' length, we will do another btree search for the next file extent item. The following script that calls dbench was used to measure the impact of this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device directly (no intermediary filesystem on the host) and using a non-debug kernel (default configuration on Debian): $ cat test-dbench.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd -o nodatacow" MKFS_OPTIONS="-m single -d single" mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT Before this change: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9326331 0.317 399.957 Close 6851198 0.002 6.402 Rename 394894 2.621 402.819 Unlink 1883131 0.931 398.082 Deltree 256 19.160 303.580 Mkdir 128 0.003 0.016 Qpathinfo 8452314 0.068 116.133 Qfileinfo 1481921 0.001 5.081 Qfsinfo 1549963 0.002 4.444 Sfileinfo 759679 0.084 17.079 Find |
|
|
|
c7c01a4a25 |
btrfs: tree-checker: annotate all error branches as unlikely
The tree checker is called many times as it verifies metadata at
read/write time. The checks follow a simple pattern:
if (error_condition) {
report_error();
return -EUCLEAN;
}
All the error reporting functions are annotated as __cold that is
supposed to hint the compiler to move the statement block out of the hot
path. This does not seem to happen that often.
As the error condition is expected to be false almost always, we can
annotate it with 'unlikely' as this satisfies one of the few use cases
for the annotation. The expected outcome is a stronger hint to compiler
to reorder the checks
test
jump to exit
test
jump to exit
...
which can be observed in asm of eg. check_dir_item,
btrfs_check_chunk_valid, check_root_item or check_leaf.
There's a measurable run time improvement reported by Josef, the testing
workload went from 655 MiB/s to 677 MiB/s, which is about +3%.
There should be no functional changes but some of the conditions have
been rewritten to produce more readable result, some lines are longer
than 80, for the sake of readability.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
a0f6d924ca |
btrfs: remove stub device info from messages when we have no fs_info
Without a NULL fs_info the helpers will print something like BTRFS error (device <unknown>): ... This can happen in contexts where fs_info is not available at all or it's potentially unsafe due to object lifetime. The <unknown> stub does not bring much information and with the prefix makes the message unnecessarily longer. Remove it for the NULL fs_info case. BTRFS error: ... Callers can add the device information to the message itself if needed. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb22e9c4cd |
btrfs: use detach_page_private() in alloc_extent_buffer()
In alloc_extent_buffer(), after we got a page from btree inode, we check if that page has private pointer attached. If attached, we check if the existing extent buffer has proper refs. If not (the eb is being freed), we will detach that private eb pointer. The point here is, we are detaching that eb pointer by calling: - ClearPagePrivate() - put_page() The put_page() here is especially confusing, as it's decreasing the ref from attach_page_private(). Without knowing that, it looks like the put_page() is for the find_or_create_page() call, confusing the reader. Since we're always modifying page private with attach_page_private() and detach_page_private(), the only open-coded detach_page_private() here is really confusing. Fix it by calling detach_page_private(). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
35478d053a |
btrfs: use nodesize to determine if we need readahead in btrfs_lookup_bio_sums
In btrfs_lookup_bio_sums() if the bio is pretty large, we want to start readahead in the csum tree. However the threshold is an immediate number, (PAGE_SIZE * 8), from the initial btrfs merge. The meaning of the value is pretty hard to guess, especially when the immediate number is from the times when 4K sectorsize was the default and only CRC32C was supported. For the most common btrfs setup, CRC32 csum and 4K sectorsize, it means just 32K read would kick readahead, while the csum itself is only 32 bytes in size. Now let's be more reasonable by taking both csum size and node size into consideration. If the csum size for the bio is larger than one leaf, then we kick the readahead. This means for current default btrfs, the threshold will be 16M. This change should not change performance observably, thus this is mostly a readability enhancement. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
829ddec922 |
btrfs: only clear EXTENT_LOCK bit in extent_invalidatepage
extent_invalidatepage() will try to clear all possible bits since it's calling clear_extent_bit() with delete == 1. This is currently fine, since for btree io tree, it only utilizes EXTENT_LOCK bit. But this could be a problem for later subpage support, which will utilize extra io tree bit to represent additional info. This patch will just convert that clear_extent_bit() to unlock_extent_cached(). For current code since only EXTENT_LOCKED bit is utilized, this doesn't change the behavior, but provides a much cleaner basis for incoming subpage support. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8e1dc982ed |
btrfs: remove unused parameter phy_offset from btrfs_validate_metadata_buffer
Parameter @phy_offset is the offset against the bio->bi_iter.bi_sector. @phy_offset is mostly for data io to lookup the csum in btrfs_io_bio. But for metadata, it's completely useless as metadata stores their own csum in its header, so we can remove it. Note: parameters @start and @end, they are not utilized at all for current sectorsize == PAGE_SIZE case, as we can grab eb directly from page. But those two parameters are very important for later subpage support, thus @start/@len are not touched here. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2c36395430 |
btrfs: scrub: remove the anonymous structure from scrub_page
That anonymous structure serve no special purpose, just replace it with regular members. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f97e27e91d |
btrfs: use fixed width int type for extent_state::state
Currently the type is unsigned int which could change its width depending on the architecture. We need up to 32 bits so make it explicit. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e09caaf913 |
btrfs: introduce helper to handle page status update in end_bio_extent_readpage()
Introduce a new helper to handle update page status in end_bio_extent_readpage(). This will be later used for subpage support where the page status update can be more complex than now. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
94e8c95ccb |
btrfs: add structure to keep track of extent range in end_bio_extent_readpage
In end_bio_extent_readpage() we had a strange dance around
extent_start/extent_len.
Hidden behind the strange dance is, it's just calling
endio_readpage_release_extent() on each bvec range.
Here is an example to explain the original work flow:
Bio is for inode 257, containing 2 pages, for range [1M, 1M+8K)
end_bio_extent_extent_readpage() entered
|- extent_start = 0;
|- extent_end = 0;
|- bio_for_each_segment_all() {
| |- /* Got the 1st bvec */
| |- start = SZ_1M;
| |- end = SZ_1M + SZ_4K - 1;
| |- update = 1;
| |- if (extent_len == 0) {
| | |- extent_start = start; /* SZ_1M */
| | |- extent_len = end + 1 - start; /* SZ_1M */
| | }
| |
| |- /* Got the 2nd bvec */
| |- start = SZ_1M + 4K;
| |- end = SZ_1M + 4K - 1;
| |- update = 1;
| |- if (extent_start + extent_len == start) {
| | |- extent_len += end + 1 - start; /* SZ_8K */
| | }
| } /* All bio vec iterated */
|
|- if (extent_len) {
|- endio_readpage_release_extent(tree, extent_start, extent_len,
update);
/* extent_start == SZ_1M, extent_len == SZ_8K, uptodate = 1 */
As the above flow shows, the existing code in end_bio_extent_readpage()
is accumulates extent_start/extent_len, and when the contiguous range
stops, calls endio_readpage_release_extent() for the range.
However current behavior has something not really considered:
- The inode can change
For bio, its pages don't need to have contiguous page_offset.
This means, even pages from different inodes can be packed into one
bio.
- bvec cross page boundary
There is a feature called multi-page bvec, where bvec->bv_len can go
beyond bvec->bv_page boundary.
- Poor readability
This patch will address the problem:
- Introduce a proper structure, processed_extent, to record processed
extent range
- Integrate inode/start/end/uptodate check into
endio_readpage_release_extent()
- Add more comment on each step.
This should greatly improve the readability, now in
end_bio_extent_readpage() there are only two
endio_readpage_release_extent() calls.
- Add inode check for contiguity
Now we also ensure the inode is the same one before checking if the
range is contiguous.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
b1d51f67c9 |
btrfs: tests: remove invalid extent-io test
In extent-io-test, there are two invalid tests: - Invalid nodesize for test_eb_bitmaps() Instead of the sectorsize and nodesize combination passed in, we're always using hand-crafted nodesize, e.g: len = (sectorsize < BTRFS_MAX_METADATA_BLOCKSIZE) ? sectorsize * 4 : sectorsize; In above case, if we have 32K page size, then we will get a length of 128K, which is beyond max node size, and obviously invalid. The common page size goes up to 64K so we haven't hit that - Invalid extent buffer bytenr For 64K page size, the only combination we're going to test is sectorsize = nodesize = 64K. However, in that case we will try to test an eb which bytenr is not sectorsize aligned: /* Do it over again with an extent buffer which isn't page-aligned. */ eb = __alloc_dummy_extent_buffer(fs_info, nodesize / 2, len); Sector alignment is a hard requirement for any sector size. The only exception is superblock. But anything else should follow sector size alignment. This is definitely an invalid test case. This patch will fix both problems by: - Honor the sectorsize/nodesize combination Now we won't bother to hand-craft the length and use it as nodesize. - Use sectorsize as the 2nd run extent buffer start This would test the case where extent buffer is aligned to sectorsize but not always aligned to nodesize. Please note that, later subpage related cleanup will reduce extent_buffer::pages[] to exactly what we need, making the sector unaligned extent buffer operations cause problems. Since only extent_io self tests utilize this, this patch is required for all later cleanup/refactoring. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
445d8ab53f |
btrfs: sysfs: remove unneeded semicolon
A semicolon is not needed after a switch statement. Signed-off-by: Tom Rix <trix@redhat.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
95b982de37 |
btrfs: simplify return values in setup_nodes_for_search
The function is needlessly convoluted. Fix that by: * removing redundant sret variable definition in both if arms * replace the again/done labels with direct return statements, the function is short enough and doesn't do anything special upon exit * remove BUG_ON on split_node returning a positive number - it can't happen as split_node returns either 0 or a negative error code. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d5286a92ea |
btrfs: remove useless return value statement in split_node
At the point when we set 'ret = 0' it's guaranteed that the function is going to return 0 so directly return 0. No functional changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f30bed8342 |
btrfs: remove unnecessary attempt to drop extent maps after adding inline extent
At inode.c:cow_file_range_inline(), after we insert the inline extent
in the fs/subvolume btree, we call btrfs_drop_extent_cache() to drop
all extent maps in the file range, however that is not necessary because
we have already done it in the call to btrfs_drop_extents(), which calls
btrfs_drop_extent_cache() for us, and since at this point we have the file
range locked in the inode's iotree (we are in the writeback path), we know
no other task can come in and read stale file extent items or find none
and therefore create either stale extent maps or an extent map that
represents a hole.
So just remove that unnecessary call to btrfs_drop_extent_cache(), as it's
doing nothing and only wasting time. This call has been around since 2008,
introduced in commit
|
|
|
|
bc5b5b1e51 |
btrfs: stop incrementing log batch when joining log transaction
When joining a log transaction we acquire the root's log mutex, then increment the root's log batch and log writers counters while holding the mutex. However we don't need to increment the log batch there, because we are holding the mutex and incremented the log writers counter as well, so any other task trying to sync log will wait for the current task to finish its logging and still achieve the desired log batching. Since the log batch counter is an atomic counter and is incremented twice at the very beginning of the fsync callback (btrfs_sync_file()), once before flushing delalloc and once again after waiting for writeback to complete, eliminating its increment when joining the log transaction may provide some performance gains in case we have multiple concurrent tasks doing fsyncs against different files in the same subvolume, as it reduces contention on the atomic (locking the cacheline and bouncing it). When testing fio with 32 jobs, on a 8 cores VM, doing fsyncs against different files of the same subvolume, on top of a zram device, I could consistently see gains (higher throughput) between 1% to 2%, which is a very low value and possibly hard to be observed with a real device (I couldn't observe consistent gains with my low/mid end NVMe device). So this change is mostly motivated to just simplify the logic, as updating the log batch counter is only relevant when an fsync starts and while not holding the root's log mutex. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f2f121ab50 |
btrfs: skip unnecessary searches for xattrs when logging an inode
Every time we log an inode we lookup in the fs/subvol tree for xattrs and if we have any, log them into the log tree. However it is very common to have inodes without any xattrs, so doing the search wastes times, but more importantly it adds contention on the fs/subvol tree locks, either making the logging code block and wait for tree locks or making the logging code making other concurrent operations block and wait. The most typical use cases where xattrs are used are when capabilities or ACLs are defined for an inode, or when SELinux is enabled. This change makes the logging code detect when an inode does not have xattrs and skip the xattrs search the next time the inode is logged, unless the inode is evicted and loaded again or a xattr is added to the inode. Therefore skipping the search for xattrs on inodes that don't ever have xattrs and are fsynced with some frequency. The following script that calls dbench was used to measure the impact of this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device directly (no intermediary filesystem on the host) and using a non-debug kernel (default configuration on Debian distributions): $ cat test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f -m single -d single $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 200 40 umount $MNT The results before this change: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 5761605 0.172 312.057 Close 4232452 0.002 10.927 Rename 243937 1.406 277.344 Unlink 1163456 0.631 298.402 Deltree 160 11.581 221.107 Mkdir 80 0.003 0.005 Qpathinfo 5221410 0.065 122.309 Qfileinfo 915432 0.001 3.333 Qfsinfo 957555 0.003 3.992 Sfileinfo 469244 0.023 20.494 Find 2018865 0.448 123.659 WriteX 2874851 0.049 118.529 ReadX 9030579 0.004 21.654 LockX 18754 0.003 4.423 UnlockX 18754 0.002 0.331 Flush 403792 10.944 359.494 Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms The results after this change: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 6442521 0.159 230.693 Close 4732357 0.002 10.972 Rename 272809 1.293 227.398 Unlink 1301059 0.563 218.500 Deltree 160 7.796 54.887 Mkdir 80 0.008 0.478 Qpathinfo 5839452 0.047 124.330 Qfileinfo 1023199 0.001 4.996 Qfsinfo 1070760 0.003 5.709 Sfileinfo 524790 0.033 21.765 Find 2257658 0.314 125.611 WriteX 3211520 0.040 232.135 ReadX 10098969 0.004 25.340 LockX 20974 0.003 1.569 UnlockX 20974 0.002 3.475 Flush 451553 10.287 331.037 Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms +10.8% throughput, -8.2% max latency Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1cab5e7283 |
btrfs: merge __set_extent_bit and set_extent_bit
There are only 2 direct calls to set_extent_bit outside of extent-io - in btrfs_find_new_delalloc_bytes and btrfs_truncate_block, the rest are thin wrappers around __set_extent_bit. This adds unnecessary indirection and just makes it more annoying when looking at the various extent bit manipulation functions. This patch renames __set_extent_bit to set_extent_bit effectively removing a level of indirection. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat and remove __must_check ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
729f796172 |
btrfs: make btrfs_update_inode_fallback take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b06359a325 |
btrfs: make btrfs_cont_expand take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
217f42eb3d |
btrfs: make btrfs_truncate_block take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
03fcb1ab6f |
btrfs: make btrfs_insert_replace_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dea46d84a3 |
btrfs: make find_first_non_hole take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a4ba6cc03e |
btrfs: make maybe_insert_hole take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9a56fcd15a |
btrfs: make btrfs_update_inode take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dfeb9e7cc3 |
btrfs: make btrfs_update_inode_item take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3fbcaef59 |
btrfs: make btrfs_delayed_update_inode take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
72e7e6edd3 |
btrfs: make btrfs_finish_ordered_io btrfs_inode-centric
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
507433985c |
btrfs: make btrfs_truncate_inode_items take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
90dffd0cff |
btrfs: make insert_prealloc_file_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
76aea53796 |
btrfs: make btrfs_inode_safe_disk_i_size_write take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a55463c9f0 |
btrfs: remove extent_buffer::recursed
It is unused everywhere now, it can be removed. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0ecae6fffe |
btrfs: remove the recurse parameter from __btrfs_tree_read_lock
It is completely unused now, remove it. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fe596ca3d3 |
btrfs: use btrfs_tree_read_lock in btrfs_search_slot
We no longer use recursion, so __btrfs_tree_read_lock(BTRFS_NESTING_NORMAL) == btrfs_tree_read_lock. Replace this call with the simple helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1bb9659841 |
btrfs: merge back btrfs_read_lock_root_node helpers
We no longer have recursive locking and there's no need for separate helpers that allowed the transition to rwsem with minimal code changes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4048daedb9 |
btrfs: locking: remove the recursion handling code
Now that we're no longer using recursion, rip out all of the supporting code. Follow up patches will clean up the callers of these functions. The extent_buffer::lock_owner is still retained as it allows safety checks in btrfs_init_new_buffer for the case that the free space cache is corrupted and we try to allocate a block that we are currently using and have locked in the path. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2f5239dcb2 |
btrfs: remove btrfs_path::recurse
With my async free space cache loading patches ("btrfs: load free space
cache asynchronously") we no longer have a user of path->recurse and can
remove it.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
0e46318df8 |
btrfs: unlock to current level in btrfs_next_old_leaf
Filipe reported the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.10.0-rc2-btrfs-next-71 #1 Not tainted ------------------------------------------------------ find/324157 is trying to acquire lock: ffff8ebc48d293a0 (btrfs-tree-01#2/3){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] but task is already holding lock: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-tree-00){++++}-{3:3}: lock_acquire+0xd8/0x490 down_write_nested+0x44/0x120 __btrfs_tree_lock+0x27/0x120 [btrfs] btrfs_search_slot+0x2a3/0xc50 [btrfs] btrfs_insert_empty_items+0x58/0xa0 [btrfs] insert_with_overflow+0x44/0x110 [btrfs] btrfs_insert_xattr_item+0xb8/0x1d0 [btrfs] btrfs_setxattr+0xd6/0x4c0 [btrfs] btrfs_setxattr_trans+0x68/0x100 [btrfs] __vfs_setxattr+0x66/0x80 __vfs_setxattr_noperm+0x70/0x200 vfs_setxattr+0x6b/0x120 setxattr+0x125/0x240 path_setxattr+0xba/0xd0 __x64_sys_setxattr+0x27/0x30 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-tree-01#2/3){++++}-{3:3}: check_prev_add+0x91/0xc60 __lock_acquire+0x1689/0x3130 lock_acquire+0xd8/0x490 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] btrfs_next_old_leaf+0x27d/0x580 [btrfs] btrfs_real_readdir+0x1e3/0x4b0 [btrfs] iterate_dir+0x170/0x1c0 __x64_sys_getdents64+0x83/0x140 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-tree-00); lock(btrfs-tree-01#2/3); lock(btrfs-tree-00); lock(btrfs-tree-01#2/3); *** DEADLOCK *** 5 locks held by find/324157: #0: ffff8ebc502c6e00 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60 #1: ffff8eb97f689980 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: iterate_dir+0x52/0x1c0 #2: ffff8ebaec00ca58 (btrfs-tree-02#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] #3: ffff8eb98f986f78 (btrfs-tree-01#2){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] #4: ffff8eb9932c5088 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] stack backtrace: CPU: 2 PID: 324157 Comm: find Not tainted 5.10.0-rc2-btrfs-next-71 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x8d/0xb5 check_noncircular+0xff/0x110 ? mark_lock.part.0+0x468/0xe90 check_prev_add+0x91/0xc60 __lock_acquire+0x1689/0x3130 ? kvm_clock_read+0x14/0x30 ? kvm_sched_clock_read+0x5/0x10 lock_acquire+0xd8/0x490 ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] down_read_nested+0x45/0x220 ? __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] __btrfs_tree_read_lock+0x32/0x1a0 [btrfs] btrfs_next_old_leaf+0x27d/0x580 [btrfs] btrfs_real_readdir+0x1e3/0x4b0 [btrfs] iterate_dir+0x170/0x1c0 __x64_sys_getdents64+0x83/0x140 ? filldir+0x1d0/0x1d0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because btrfs_next_old_leaf searches down to our current key, and then walks up the path until we can move to the next slot, and then reads back down the path so we get the next leaf. However it doesn't unlock any lower levels until it replaces them with the new extent buffer. This is technically fine, but of course causes lockdep to complain, because we could be holding locks on lower levels while locking upper levels. Fix this by dropping all nodes below the level that we use as our new starting point before we start reading back down the path. This also allows us to drop the nested/recursive locking magic, because we're no longer locking two nodes at the same level anymore. Reported-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ffeb03cfe2 |
btrfs: cleanup the locking in btrfs_next_old_leaf
We are carrying around this next_rw_lock from when we would do spinning vs blocking read locks. Now that we have the rwsem locking we can simply use the read lock flag unconditionally and the read lock helpers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b2598edf8b |
btrfs: remove unused argument seed from btrfs_find_device
Commit 343694eee8d8 ("btrfs: switch seed device to list api"), missed to
check if the parameter seed is true in the function btrfs_find_device().
This tells it whether to traverse the seed device list or not.
After this commit, the argument is unused and can be removed.
In device_list_add() it's not necessary because fs_devices always points
to the device's fs_devices. So with the devid+uuid matching, it will
find the right device and return, thus not needing to traverse seed
devices.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
3a160a9331 |
btrfs: drop never met disk total bytes check in verify_one_dev_extent
Drop the condition in verify_one_dev_extent,
btrfs_device::disk_total_bytes is set even for a seed device. The
comment is wrong, the size is properly set when cloning the device.
Commit
|
|
|
|
bacce86ae8 |
btrfs: drop unused argument step from btrfs_free_extra_devids
Commit
|
|
|
|
2766ff6176 |
btrfs: update the number of bytes used by an inode atomically
There are several occasions where we do not update the inode's number of used bytes atomically, resulting in a concurrent stat(2) syscall to report a value of used blocks that does not correspond to a valid value, that is, a value that does not match neither what we had before the operation nor what we get after the operation completes. In extreme cases it can result in stat(2) reporting zero used blocks, which can cause problems for some userspace tools where they can consider a file with a non-zero size and zero used blocks as completely sparse and skip reading data, as reported/discussed a long time ago in some threads like the following: https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html The cases where this can happen are the following: -> Case 1 If we do a write (buffered or direct IO) against a file region for which there is already an allocated extent (or multiple extents), then we have a short time window where we can report a number of used blocks to stat(2) that does not take into account the file region being overwritten. This short time window happens when completing the ordered extent(s). This happens because when we drop the extents in the write range we decrement the inode's number of bytes and later on when we insert the new extent(s) we increment the number of bytes in the inode, resulting in a short time window where a stat(2) syscall can get an incorrect number of used blocks. If we do writes that overwrite an entire file, then we have a short time window where we report 0 used blocks to stat(2). Example reproducer: $ cat reproducer-1.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null expected=$(stat -c %b $MNT/foobar) # Create a process to keep calling stat(2) on the file and see if the # reported number of blocks used (disk space used) changes, it should # not because we are not increasing the file size nor punching holes. stat_loop $MNT/foobar $expected & loop_pid=$! for ((i = 0; i < 50000; i++)); do xfs_io -s -c "pwrite -b 64K 0 64K" $MNT/foobar >/dev/null done kill $loop_pid &> /dev/null wait umount $DEV $ ./reproducer-1.sh ERROR: unexpected used blocks (got: 0 expected: 128) ERROR: unexpected used blocks (got: 0 expected: 128) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 2 If we do a buffered write against a file region that does not have any allocated extents, like a hole or beyond EOF, then during ordered extent completion we have a short time window where a concurrent stat(2) syscall can report a number of used blocks that does not correspond to the value before or after the write operation, a value that is actually larger than the value after the write completes. This happens because once we start a buffered write into an unallocated file range we increment the inode's 'new_delalloc_bytes', to make sure any stat(2) call gets a correct used blocks value before delalloc is flushed and completes. However at ordered extent completion, after we inserted the new extent, we increment the inode's number of bytes used with the size of the new extent, and only later, when clearing the range in the inode's iotree, we decrement the inode's 'new_delalloc_bytes' counter with the size of the extent. So this results in a short time window where a concurrent stat(2) syscall can report a number of used blocks that accounts for the new extent twice. Example reproducer: $ cat reproducer-2.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi stat_loop() { trap "wait; exit" SIGTERM local filepath=$1 local expected=$2 local got while :; do got=$(stat -c %b $filepath) if [ $got -ne $expected ]; then echo -n "ERROR: unexpected used blocks" echo " (got: $got expected: $expected)" fi done } mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null # mkfs.reiserfs -f $DEV > /dev/null mount $DEV $MNT touch $MNT/foobar write_size=$((64 * 1024)) for ((i = 0; i < 16384; i++)); do offset=$(($i * $write_size)) xfs_io -c "pwrite -S 0xab $offset $write_size" $MNT/foobar >/dev/null blocks_used=$(stat -c %b $MNT/foobar) # Fsync the file to trigger writeback and keep calling stat(2) on it # to see if the number of blocks used changes. stat_loop $MNT/foobar $blocks_used & loop_pid=$! xfs_io -c "fsync" $MNT/foobar kill $loop_pid &> /dev/null wait $loop_pid done umount $DEV $ ./reproducer-2.sh ERROR: unexpected used blocks (got: 265472 expected: 265344) ERROR: unexpected used blocks (got: 284032 expected: 283904) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. -> Case 3 Another case where such problems happen is during other operations that replace extents in a file range with other extents. Those operations are extent cloning, deduplication and fallocate's zero range operation. The cause of the problem is similar to the first case. When we drop the extents from a range, we decrement the inode's number of bytes, and later on, after inserting the new extents we increment it. Since this is not done atomically, a concurrent stat(2) call can see and return a number of used blocks that is smaller than it should be, does not match the number of used blocks before or after the clone/deduplication/zero operation. Like for the first case, when doing a clone, deduplication or zero range operation against an entire file, we end up having a time window where we can report 0 used blocks to a stat(2) call. Example reproducer: $ cat reproducer-3.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f -m reflink=1 $DEV > /dev/null mount $DEV $MNT extent_size=$((64 * 1024)) num_extents=16384 file_size=$(($extent_size * $num_extents)) # File foo has many small extents. xfs_io -f -s -c "pwrite -S 0xab -b $extent_size 0 $file_size" $MNT/foo \ > /dev/null # File bar has much less extents and has exactly the same data as foo. xfs_io -f -c "pwrite -S 0xab 0 $file_size" $MNT/bar > /dev/null expected=$(stat -c %b $MNT/foo) # Now deduplicate bar into foo. While the deduplication is in progres, # the number of used blocks/file size reported by stat should not change xfs_io -c "dedupe $MNT/bar 0 0 $file_size" $MNT/foo > /dev/null & dedupe_pid=$! while [ -n "$(ps -p $dedupe_pid -o pid=)" ]; do used=$(stat -c %b $MNT/foo) if [ $used -ne $expected ]; then echo "Unexpected blocks used: $used (expected: $expected)" fi done umount $DEV $ ./reproducer-3.sh Unexpected blocks used: 2076800 (expected: 2097152) Unexpected blocks used: 2097024 (expected: 2097152) Unexpected blocks used: 2079872 (expected: 2097152) (...) Note that since this is a short time window where the race can happen, the reproducer may not be able to always trigger the bug in one run, or it may trigger it multiple times. So fix this by: 1) Making btrfs_drop_extents() not decrement the VFS inode's number of bytes, and instead return the number of bytes; 2) Making any code that drops extents and adds new extents update the inode's number of bytes atomically, while holding the btrfs inode's spinlock, which is also used by the stat(2) callback to get the inode's number of bytes; 3) For ranges in the inode's iotree that are marked as 'delalloc new', corresponding to previously unallocated ranges, increment the inode's number of bytes when clearing the 'delalloc new' bit from the range, in the same critical section that decrements the inode's 'new_delalloc_bytes' counter, delimited by the btrfs inode's spinlock. An alternative would be to have btrfs_getattr() wait for any IO (ordered extents in progress) and locking the whole range (0 to (u64)-1) while it it computes the number of blocks used. But that would mean blocking stat(2), which is a very used syscall and expected to be fast, waiting for writes, clone/dedupe, fallocate, page reads, fiemap, etc. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7f458a3873 |
btrfs: fix race when defragmenting leads to unnecessary IO
When defragmenting we skip ranges that have holes or inline extents, so that we don't do unnecessary IO and waste space. We do this check when calling should_defrag_range() at btrfs_defrag_file(). However we do it without holding the inode's lock. The reason we do it like this is to avoid blocking other tasks for too long, that possibly want to operate on other file ranges, since after the call to should_defrag_range() and before locking the inode, we trigger a synchronous page cache readahead. However before we were able to lock the inode, some other task might have punched a hole in our range, or we may now have an inline extent there, in which case we should not set the range for defrag anymore since that would cause unnecessary IO and make us waste space (i.e. allocating extents to contain zeros for a hole). So after we locked the inode and the range in the iotree, check again if we have holes or an inline extent, and if we do, just skip the range. I hit this while testing my next patch that fixes races when updating an inode's number of bytes (subject "btrfs: update the number of bytes used by an inode atomically"), and it depends on this change in order to work correctly. Alternatively I could rework that other patch to detect holes and flag their range with the 'new delalloc' bit, but this itself fixes an efficiency problem due a race that from a functional point of view is not harmful (it could be triggered with btrfs/062 from fstests). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5893dfb98f |
btrfs: refactor btrfs_drop_extents() to make it easier to extend
There are many arguments for __btrfs_drop_extents() and its wrapper
btrfs_drop_extents(), which makes it hard to add more arguments to it and
requires changing every caller. I have added a couple myself back in 2014
commit
|
|
|
|
e114c545bb |
btrfs: set the lockdep class for extent buffers on creation
Both Filipe and Fedora QA recently hit the following lockdep splat: WARNING: possible recursive locking detected 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Not tainted -------------------------------------------- rsync/2610 is trying to acquire lock: ffff89617ed48f20 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 but task is already holding lock: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(&eb->lock); lock(&eb->lock); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by rsync/2610: #0: ffff896107212b90 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: walk_component+0x10c/0x190 #1: ffff8961757b1130 (&eb->lock){++++}-{2:2}, at: btrfs_tree_read_lock_atomic+0x34/0x140 stack backtrace: CPU: 1 PID: 2610 Comm: rsync Not tainted 5.10.0-0.rc1.20201028gited8780e3f2ec.57.fc34.x86_64 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x8b/0xb0 __lock_acquire.cold+0x12d/0x2a4 ? kvm_sched_clock_read+0x14/0x30 ? sched_clock+0x5/0x10 lock_acquire+0xc8/0x400 ? btrfs_tree_read_lock_atomic+0x34/0x140 ? read_block_for_search.isra.0+0xdd/0x320 _raw_read_lock+0x3d/0xa0 ? btrfs_tree_read_lock_atomic+0x34/0x140 btrfs_tree_read_lock_atomic+0x34/0x140 btrfs_search_slot+0x616/0x9a0 btrfs_lookup_dir_item+0x6c/0xb0 btrfs_lookup_dentry+0xa8/0x520 ? lockdep_init_map_waits+0x4c/0x210 btrfs_lookup+0xe/0x30 __lookup_slow+0x10f/0x1e0 walk_component+0x11b/0x190 path_lookupat+0x72/0x1c0 filename_lookup+0x97/0x180 ? strncpy_from_user+0x96/0x1e0 ? getname_flags.part.0+0x45/0x1a0 vfs_statx+0x64/0x100 ? lockdep_hardirqs_on_prepare+0xff/0x180 ? _raw_spin_unlock_irqrestore+0x41/0x50 __do_sys_newlstat+0x26/0x40 ? lockdep_hardirqs_on_prepare+0xff/0x180 ? syscall_enter_from_user_mode+0x27/0x80 ? syscall_enter_from_user_mode+0x27/0x80 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 I have also seen a report of lockdep complaining about the lock class that was looked up being the same as the lock class on the lock we were using, but I can't find the report. These are problems that occur because we do not have the lockdep class set on the extent buffer until _after_ we read the eb in properly. This is problematic for concurrent readers, because we will create the extent buffer, lock it, and then attempt to read the extent buffer. If a second thread comes in and tries to do a search down the same path they'll get the above lockdep splat because the class isn't set properly on the extent buffer. There was a good reason for this, we generally didn't know the real owner of the eb until we read it, specifically in refcounted roots. However now all refcounted roots have the same class name, so we no longer need to worry about this. For non-refcounted trees we know which root we're on based on the parent. Fix this by setting the lockdep class on the eb at creation time instead of read time. This will fix the splat and the weirdness where the class changes in the middle of locking the block. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3fbaf25817 |
btrfs: pass the owner_root and level to alloc_extent_buffer
Now that we've plumbed all of the callers to have the owner root and the level, plumb it down into alloc_extent_buffer(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5d81230baa |
btrfs: pass the root owner and level around for readahead
The readahead infrastructure does raw reads of extent buffers, but we're going to need to know their owner and level in order to set the lockdep key properly, so plumb in the infrastructure that we'll need to have this information when we start allocating extent buffers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1b7ec85ef4 |
btrfs: pass root owner to read_tree_block
In order to properly set the lockdep class of a newly allocated block we need to know the owner of the block. For non-refcounted trees this is straightforward, we always know in advance what tree we're reading from. For refcounted trees we don't necessarily know, however all refcounted trees share the same lockdep class name, tree-<level>. Fix all the callers of read_tree_block() to pass in the root objectid we're using. In places like relocation and backref we could probably unconditionally use 0, but just in case use the root when we have it, otherwise use 0 in the cases we don't have the root as it's going to be a refcounted tree anyway. This is a preparation patch for further changes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
182c79fcb8 |
btrfs: use btrfs_read_node_slot in btrfs_qgroup_trace_subtree
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3acfbd6a99 |
btrfs: use btrfs_read_node_slot in qgroup_trace_new_subtree_blocks
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6b2cb7cb95 |
btrfs: use btrfs_read_node_slot in qgroup_trace_extent_swap
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c990ada2a0 |
btrfs: use btrfs_read_node_slot in walk_down_tree
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6b3426be27 |
btrfs: use btrfs_read_node_slot in replace_path
We're open-coding btrfs_read_node_slot() here, replace with the helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c975253682 |
btrfs: use btrfs_read_node_slot in do_relocation
We're open coding btrfs_read_node_slot in do_relocation, replace this with the proper helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8ef385bbf0 |
btrfs: use btrfs_read_node_slot in walk_down_reloc_tree
We do not need to call read_tree_block() here, simply use the btrfs_read_node_slot helper. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
206983b72a |
btrfs: use btrfs_read_node_slot in btrfs_realloc_node
We have this open-coded nightmare in btrfs_realloc_node that does the same thing that the normal read path does, which is to see if we have the eb in memory already, and if not read it, and verify the eb is uptodate. Delete this open coding and simply use btrfs_read_node_slot. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bfb484d922 |
btrfs: cleanup extent buffer readahead
We're going to pass around more information when we allocate extent buffers, in order to make that cleaner how we do readahead. Most of the callers have the parent node that we're getting our blockptr from, with the sole exception of relocation which simply has the bytenr it wants to read. Add a helper that takes the current arguments that we need (bytenr and gen), and add another helper for simply reading the slot out of a node. In followup patches the helper that takes all the extra arguments will be expanded, and the simpler helper won't need to have it's arguments adjusted. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
416e3445ef |
btrfs: remove lockdep classes for the fs tree
We have this weird problem where our lockdep class is set after we read a tree block, which can race with concurrent readers and result in erroneous lockdep errors. We want to set the lockdep class at allocation time if possible, but in certain cases we may not have the actual root owner, such as with relocation or any backref lookups. This is only really a problem for reference counted trees, because all other trees have their root reference set in their extent reference. Remove the fs tree specific lock class. We need to still keep the reloc tree one, it's still reference counted, because replace_path will lock the reloc tree and the destination tree, and if they're both set to tree-<level> we'll have issues. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3e48d8d254 |
btrfs: discard: reschedule work after sysfs param update
After sysfs updates discard's iops_limit or kbps_limit it also needs to adjust current timer through rescheduling, otherwise the discard work may wait for a long time for the previous timer to expire or bumped by someone else. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
df903e5d29 |
btrfs: don't miss async discards after scheduled work override
If btrfs_discard_schedule_work() is called with override=true, it sets delay anew regardless how much time is left until the timer should have fired. If delays are long (that can happen, for example, with low kbps_limit), they might get constantly overridden without having a chance to run the discard work. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6e88f116bd |
btrfs: discard: store async discard delay as ns not as jiffies
Most delay calculations are done in ns or ms, so store discard_ctl->delay in ms and convert the final delay to jiffies only at the end. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e50404a8a6 |
btrfs: discard: speed up async discard up to iops_limit
Instead of using iops_limit only for cutting off extremes, calculate the discard delay directly from it, so it closely follows iops_limit and doesn't under-discard even though quotas are not saturated. The iops limit could be hit more often in some cases and could increase the discard rate. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
480a8ec83b |
btrfs: scrub: refactor scrub_find_csum()
Function scrub_find_csum() is to locate the csum for bytenr @logical from sctx->csum_list. However it lacks a lot of comments to explain things like how the csum_list is organized and why we need to drop csum range which is before us. Refactor the function by: - Add more comments explaining the behavior - Add comment explaining why we need to drop the csum range - Put the csum copy in the main loop This is mostly for the incoming patches to make scrub_find_csum() able to find multiple checksums. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
96e63a45fb |
btrfs: scrub: remove the force parameter from scrub_pages
The @force parameter for scrub_pages() is to indicate whether we want to force bio submission. Currently it's only used for the super block, and it can be easily determined by the @flags, so we can remove the parameter. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
261d2dcb24 |
btrfs: scrub: distinguish scrub page from regular page
There are several call sites where we declare something like "struct scrub_page *page". This is confusing as we also use regular page in this code, rename it to 'spage' where applicable. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ac303b6987 |
btrfs: pass bvec to csum_dirty_buffer instead of page
Currently csum_dirty_buffer() uses page to grab extent buffer, but that only works for sector size == PAGE_SIZE case. For subpage we need page + page_offset to grab extent buffer. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
77bf40a2ba |
btrfs: extract extent buffer verification from btrfs_validate_metadata_buffer()
Currently btrfs_validate_metadata_buffer() only needs to handle one extent buffer as currently one page maps to at most one extent buffer. For incoming subpage support, we need to extend the support where one page could contain multiple extent buffers. Split the function so we can call validate_extent_buffer on extent buffers independently. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a26663e7a2 |
btrfs: make csum_tree_block() handle node smaller than page
For subpage size support, metadata blocks of nodesize are smaller than one page and this needs to be handled when calculating the checksum. The checksummed start and length need to be adjusted but only for the first page: - start is simply offset in the page - length is nodesize (subpage) or PAGE_SIZE for all other cases Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2f4d60dfae |
btrfs: grab fs_info from extent_buffer in btrfs_mark_buffer_dirty
Since commit
|
|
|
|
478ef8868f |
btrfs: make buffer_radix take sector size units
For subpage sector size support, one page can contain multiple tree blocks. The entries cannot be based on page size and index must be derived from the sectorsize. No change for page size == sector size. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0d01e247a0 |
btrfs: assert page mapping lock in attach_extent_buffer_page
When calling attach_extent_buffer_page(), either we're attaching anonymous pages, called from btrfs_clone_extent_buffer(), or we're attaching btree inode pages, called from alloc_extent_buffer(). For the latter case, we should hold page->mapping->private_lock to avoid parallel changes to page->private. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bbb86a3717 |
btrfs: protect fs_info->caching_block_groups by block_group_cache_lock
I got the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.9.0+ #101 Not tainted ------------------------------------------------------ btrfs-cleaner/3445 is trying to acquire lock: ffff89dbec39ab48 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170 but task is already holding lock: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (&fs_info->commit_root_sem){++++}-{3:3}: down_write+0x3d/0x70 btrfs_cache_block_group+0x2d5/0x510 find_free_extent+0xb6e/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&space_info->groups_sem){++++}-{3:3}: down_read+0x40/0x130 find_free_extent+0x2ed/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0xb9/0x3d0 down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 btrfs_find_root+0x35/0x1b0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 kthread+0x13a/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: btrfs-root-00 --> &space_info->groups_sem --> &fs_info->commit_root_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->commit_root_sem); lock(&space_info->groups_sem); lock(&fs_info->commit_root_sem); lock(btrfs-root-00); *** DEADLOCK *** 3 locks held by btrfs-cleaner/3445: #0: ffff89dbeaf28838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140 #1: ffff89dbeb6c7640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0 #2: ffff89dbeaf28a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 stack backtrace: CPU: 0 PID: 3445 Comm: btrfs-cleaner Not tainted 5.9.0+ #101 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 __lock_acquire+0x1167/0x2150 ? __bfs+0x42/0x210 lock_acquire+0xb9/0x3d0 ? __btrfs_tree_read_lock+0x32/0x170 down_read_nested+0x43/0x130 ? __btrfs_tree_read_lock+0x32/0x170 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 ? find_held_lock+0x2b/0x80 btrfs_find_root+0x35/0x1b0 ? do_raw_spin_unlock+0x4b/0xa0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 ? btrfs_clean_one_deleted_snapshot+0x73/0x110 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 ? btrfs_alloc_root+0x50/0x50 kthread+0x13a/0x150 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 while testing another lockdep fix. This happens because we're using the commit_root_sem to protect fs_info->caching_block_groups, which creates a dependency on the groups_sem -> commit_root_sem, which is problematic because we will allocate blocks while holding tree roots. Fix this by making the list itself protected by the fs_info->block_group_cache_lock. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e747853cae |
btrfs: load free space cache asynchronously
While documenting the usage of the commit_root_sem, I noticed that we do not actually take the commit_root_sem in the case of the free space cache. This is problematic because we're supposed to hold that sem while we're reading the commit roots, which is what we do for the free space cache. The reason I did it inline when I originally wrote the code was because there's the case of unpinning where we need to make sure that the free space cache is loaded if we're going to use the free space cache. But we can accomplish the same thing by simply waiting for the cache to be loaded. Rework this code to load the free space cache asynchronously. This allows us to greatly cleanup the caching code because now it's all shared by the various caching methods. We also are now in a position to have the commit_root semaphore held while we're loading the free space cache. And finally our modification of ->last_byte_to_unpin is removed because it can be handled in the proper way on commit. Some care must be taken when replaying the log, when we expect that the free space cache will be read entirely before we start excluding space to replay. This could lead to overwriting space during replay. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4d7240f0ab |
btrfs: load the free space cache inode extents from commit root
Historically we've allowed recursive locking specifically for the free space inode. This is because we are only doing reads and know that it's safe. However we don't actually need this feature, we can get away with reading the commit root for the extents. In fact if we want to allow asynchronous loading of the free space cache we have to use the commit root, otherwise we will deadlock. Switch to using the commit root for the file extents. These are only read at load time, and are replaced as soon as we start writing the cache out to disk. The cache is never read again, so this is legitimate. This matches what we do for the inode itself, as we read that from the commit root as well. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cd79909bc7 |
btrfs: load free space cache into a temporary ctl
The free space cache has been special in that we would load it right away instead of farming the work off to a worker thread. This resulted in some weirdness that had to be taken into account for this fact, namely that if we every found a block group being cached the fast way we had to wait for it to finish, because we could get the cache before it had been validated and we may throw the cache away. To handle this particular case instead create a temporary btrfs_free_space_ctl to load the free space cache into. Then once we've validated that it makes sense, copy it's contents into the actual block_group->free_space_ctl. This allows us to avoid the problems of needing to wait for the caching to complete, we can clean up the discard extent handling stuff in __load_free_space_cache, and we no longer need to do the merge_space_tree() because the space is added one by one into the real free_space_ctl. This will allow further reworks of how we handle loading the free space cache. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
66b53bae46 |
btrfs: cleanup btrfs_discard_update_discardable usage
This passes in the block_group and the free_space_ctl, but we can get this from the block group itself. Part of this is because we call it from __load_free_space_cache, which can be called for the inode cache as well. Move that call into the block group specific load section, wrap it in the right lock that we need for the assertion (but otherwise this is safe without the lock because this happens in single-thread context). Fix up the arguments to only take the block group. Add a lockdep_assert as well for good measure to make sure we don't mess up the locking again. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2ca08c56e8 |
btrfs: explicitly protect ->last_byte_to_unpin in unpin_extent_range
Currently unpin_extent_range happens in the transaction commit context, so we are protected from ->last_byte_to_unpin changing while we're unpinning, because any new transactions would have to wait for us to complete before modifying ->last_byte_to_unpin. However in the future we may want to change how this works, for instance with async unpinning or other such TODO items. To prepare for that future explicitly protect ->last_byte_to_unpin with the commit_root_sem so we are sure it won't change while we're doing our work. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
27d56e62e4 |
btrfs: update last_byte_to_unpin in switch_commit_roots
While writing an explanation for the need of the commit_root_sem for
btrfs_prepare_extent_commit, I realized we have a slight hole that could
result in leaked space if we have to do the old style caching. Consider
the following scenario
commit root
+----+----+----+----+----+----+----+
|\\\\| |\\\\|\\\\| |\\\\|\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
new commit root
+----+----+----+----+----+----+----+
| | | |\\\\| | |\\\\|
+----+----+----+----+----+----+----+
0 1 2 3 4 5 6 7
Prior to this patch, we run btrfs_prepare_extent_commit, which updates
the last_byte_to_unpin, and then we subsequently run
switch_commit_roots. In this example lets assume that
caching_ctl->progress == 1 at btrfs_prepare_extent_commit() time, which
means that cache->last_byte_to_unpin == 1. Then we go and do the
switch_commit_roots(), but in the meantime the caching thread has made
some more progress, because we drop the commit_root_sem and re-acquired
it. Now caching_ctl->progress == 3. We swap out the commit root and
carry on to unpin.
The race can happen like:
1) The caching thread was running using the old commit root when it
found the extent for [2, 3);
2) Then it released the commit_root_sem because it was in the last
item of a leaf and the semaphore was contended, and set ->progress
to 3 (value of 'last'), as the last extent item in the current leaf
was for the extent for range [2, 3);
3) Next time it gets the commit_root_sem, will start using the new
commit root and search for a key with offset 3, so it never finds
the hole for [2, 3).
So the caching thread never saw [2, 3) as free space in any of the
commit roots, and by the time finish_extent_commit() was called for
the range [0, 3), ->last_byte_to_unpin was 1, so it only returned the
subrange [0, 1) to the free space cache, skipping [2, 3).
In the unpin code we have last_byte_to_unpin == 1, so we unpin [0,1),
but do not unpin [2,3). However because caching_ctl->progress == 3 we
do not see the newly freed section of [2,3), and thus do not add it to
our free space cache. This results in us missing a chunk of free space
in memory (on disk too, unless we have a power failure before writing
the free space cache to disk).
Fix this by making sure the ->last_byte_to_unpin is set at the same time
that we swap the commit roots, this ensures that we will always be
consistent.
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ update changelog with Filipe's review comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
9076dbd5ee |
btrfs: do not shorten unpin len for caching block groups
While fixing up our ->last_byte_to_unpin locking I noticed that we will shorten len based on ->last_byte_to_unpin if we're caching when we're adding back the free space. This is correct for the free space, as we cannot unpin more than ->last_byte_to_unpin, however we use len to adjust the ->bytes_pinned counters and such, which need to track the actual pinned usage. This could result in WARN_ON(space_info->bytes_pinned) triggering at unmount time. Fix this by using a local variable for the amount to add to free space cache, and leave len untouched in this case. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dc51616486 |
btrfs: reorder extent buffer members for better packing
After the rwsem replaced the tree lock implementation, the extent buffer
got smaller but leaving some holes behind. By changing log_index type
and reordering, we can squeeze the size further to 240 bytes, measured on
release config on x86_64. Log_index spans only 3 values and needs to be
signed.
Before:
struct extent_buffer {
u64 start; /* 0 8 */
long unsigned int len; /* 8 8 */
long unsigned int bflags; /* 16 8 */
struct btrfs_fs_info * fs_info; /* 24 8 */
spinlock_t refs_lock; /* 32 4 */
atomic_t refs; /* 36 4 */
atomic_t io_pages; /* 40 4 */
int read_mirror; /* 44 4 */
struct callback_head callback_head __attribute__((__aligned__(8))); /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
pid_t lock_owner; /* 64 4 */
bool lock_recursed; /* 68 1 */
/* XXX 3 bytes hole, try to pack */
struct rw_semaphore lock; /* 72 40 */
short int log_index; /* 112 2 */
/* XXX 6 bytes hole, try to pack */
struct page * pages[16]; /* 120 128 */
/* size: 248, cachelines: 4, members: 14 */
/* sum members: 239, holes: 2, sum holes: 9 */
/* forced alignments: 1 */
/* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));
After:
struct extent_buffer {
u64 start; /* 0 8 */
long unsigned int len; /* 8 8 */
long unsigned int bflags; /* 16 8 */
struct btrfs_fs_info * fs_info; /* 24 8 */
spinlock_t refs_lock; /* 32 4 */
atomic_t refs; /* 36 4 */
atomic_t io_pages; /* 40 4 */
int read_mirror; /* 44 4 */
struct callback_head callback_head __attribute__((__aligned__(8))); /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
pid_t lock_owner; /* 64 4 */
bool lock_recursed; /* 68 1 */
s8 log_index; /* 69 1 */
/* XXX 2 bytes hole, try to pack */
struct rw_semaphore lock; /* 72 40 */
struct page * pages[16]; /* 112 128 */
/* size: 240, cachelines: 4, members: 14 */
/* sum members: 238, holes: 1, sum holes: 2 */
/* forced alignments: 1 */
/* last cacheline: 48 bytes */
} __attribute__((__aligned__(8)));
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
b9729ce014 |
btrfs: locking: rip out path->leave_spinning
We no longer distinguish between blocking and spinning, so rip out all this code. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ac5887c8e0 |
btrfs: locking: remove all the blocking helpers
Now that we're using a rw_semaphore we no longer need to indicate if a lock is blocking or not, nor do we need to flip the entire path from blocking to spinning. Remove these helpers and all the places they are called. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2ae0c2d80d |
btrfs: scrub: remove local copy of csum_size from context
The context structure unnecessarily stores copy of the checksum size, that can be now easily obtained from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
419b791ce7 |
btrfs: check integrity: remove local copy of csum_size
The state structure unnecessarily stores copy of the checksum size, that can be now easily obtained from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
713cebfb98 |
btrfs: remove unnecessary local variables for checksum size
Remove local variable that is then used just once and replace it with fs_info::csum_size. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
223486c27b |
btrfs: switch cached fs_info::csum_size from u16 to u32
The fs_info value is 32bit, switch also the local u16 variables. This leads to a better assembly code generated due to movzwl. This simple change will shave some bytes on x86_64 and release config: text data bss dec hex filename 1090000 17980 14912 1122892 11224c pre/btrfs.ko 1089794 17980 14912 1122686 11217e post/btrfs.ko DELTA: -206 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
55fc29bed8 |
btrfs: use cached value of fs_info::csum_size everywhere
btrfs_get_16 shows up in the system performance profiles (helper to read 16bit values from on-disk structures). This is partially because of the checksum size that's frequently read along with data reads/writes, other u16 uses are from item size or directory entries. Replace all calls to btrfs_super_csum_size by the cached value from fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fe5ecbe818 |
btrfs: precalculate checksums per leaf once
btrfs_csum_bytes_to_leaves shows up in system profiles, which makes it a candidate for optimizations. After the 64bit division has been replaced by shift, there's still a calculation done each time the function is called: checksums per leaf. As this is a constant value for the entire filesystem lifetime, we can calculate it once at mount time and reuse. This also allows to reduce the division to 64bit/32bit as we know the constant will always fit the 32bit type. Replace the open-coded rounding up with a macro that internally handles the 64bit division and as it's now a short function, make it static inline (slight code increase, slight stack usage reduction). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
22b6331d96 |
btrfs: store precalculated csum_size in fs_info
In many places we need the checksum size and it is inefficient to read it from the raw superblock. Store the value into fs_info, actual use will be in followup patches. The size is u32 as it allows to generate better assembly than with u16. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
265fdfa6ce |
btrfs: replace s_blocksize_bits with fs_info::sectorsize_bits
The value of super_block::s_blocksize_bits is the same as fs_info::sectorsize_bits, but we don't need to do the extra dereferences in many functions and storing the bits as u32 (in fs_info) generates shorter assembly. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
098e63082b |
btrfs: replace div_u64 by shift in free_space_bitmap_size
Change free_space_bitmap_size to take btrfs_fs_info so we can get the sectorsize_bits to do calculations. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ab108d992b |
btrfs: use precalculated sectorsize_bits from fs_info
We do a lot of calculations where we divide or multiply by sectorsize. We also know and make sure that sectorsize is a power of two, so this means all divisions can be turned to shifts and avoid eg. expensive u64/u32 divisions. The type is u32 as it's more register friendly on x86_64 compared to u8 and the resulting assembly is smaller (movzbl vs movl). There's also superblock s_blocksize_bits but it's usually one more pointer dereference farther than fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e940e9a7c7 |
btrfs: rename page_size to io_size in submit_extent_page
The variable @page_size in submit_extent_page() is not related to page size. It can already be smaller than PAGE_SIZE, so rename it to io_size to reduce confusion, this is especially important for later subpage support. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8b8bbd461e |
btrfs: only require sector size alignment for page read
If we're reading partial page, btrfs will warn about this as read/write is always done in sector size, which now equals page size. But for the upcoming subpage read-only support, our data read is only aligned to sectorsize, which can be smaller than page size. Thus here we change the warning condition to check it against sectorsize, the behavior is not changed for regular sectorsize == PAGE_SIZE case, and won't report error for subpage read. Also, pass the proper start/end with bv_offset for check_data_csum() to handle. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
12e3360f74 |
btrfs: rename pages_locked in process_pages_contig()
Function process_pages_contig() does not only handle page locking but also other operations. Rename the local variable pages_locked to pages_processed to reduce confusion. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
265d4ac03f |
btrfs: sink parameter start and len to check_data_csum
For check_data_csum(), the page we're using is directly from the inode mapping, thus it has valid page_offset(). We can use (page_offset() + pg_off) to replace @start parameter completely, while the @len should always be sectorsize. Since we're here, also add some comment, as there are quite some confusion in words like start/offset, without explaining whether it's file_offset or logical bytenr. This should not affect the existing behavior, as for current sectorsize == PAGE_SIZE case, @pgoff should always be 0, and len is always PAGE_SIZE (or sectorsize from the dio read path). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8896a08d8e |
btrfs: replace fs_info and private_data with inode in btrfs_wq_submit_bio
All callers of btrfs_wq_submit_bio() pass struct inode as @private_data, so there is no need for it to be (void *), replace it with "struct inode *inode". While we can extract fs_info from struct inode, also remove the @fs_info parameter. Since we're here, also replace all the (void *private_data) into (struct inode *inode). Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3f6bb4aeb5 |
btrfs: sink the failed_start parameter to set_extent_bit
The @failed_start parameter is only paired with @exclusive_bits, and those parameters are only used for EXTENT_LOCKED bit, which have their own wrappers lock_extent_bits(). Thus for regular set_extent_bit() calls, the failed_start makes no sense, just sink the parameter. Also, since @failed_start and @exclusive_bits are used in pairs, add an assert to make it obvious. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
03509b781a |
btrfs: update the comment for find_first_extent_bit
The pitfall here is, if the parameter @bits has multiple bits set, we will return the first range which just has one of the specified bits set. This is a little tricky if we want an exact match. Anyway, update the comment to make that clear. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a3efb2f0ba |
btrfs: fix the comment on lock_extent_buffer_for_io
The return value of that function is completely wrong.
That function only returns 0 if the extent buffer doesn't need to be
submitted. The "ret = 1" and "ret = 0" are determined by the return
value of "test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags)".
And if we get ret == 1, it's because the extent buffer is dirty, and we
set its status to EXTENT_BUFFER_WRITE_BACK, and continue to page
locking.
While if we get ret == 0, it means the extent is not dirty from the
beginning, so we don't need to write it back.
The caller also follows this, in btree_write_cache_pages(), if
lock_extent_buffer_for_io() returns 0, we just skip the extent buffer
completely.
So the comment is completely wrong.
Since we're here, also change the description a little. The write bio
flushing won't be visible to the caller, thus it's not an major feature.
In the main description, only describe the locking part to make the
point more clear.
For reference, added in commit
|
|
|
|
cc7c77146e |
btrfs: remove unnecessary casts in printk
Long time ago the explicit casts were necessary for u64 but we don't need it. Remove casts where the type matches, leaving only cases that cast sector_t or loff_t. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c842268458 |
btrfs: add set/get accessors for root_item::drop_level
The drop_level member is used directly unlike all the other int types in root_item. Add the definition and use it everywhere. The type is u8 so there's no conversion necessary and the helpers are properly inlined, this is for consistency. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f944d2cb20 |
btrfs: use root_item helpers for limit and flags in btrfs_create_tree
For consistency use the available helpers to set flags and limit. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3b5418fba3 |
btrfs: check-integrity: use proper helper to access btrfs_header
There's one raw use of le->cpu conversion but we have a helper to do that for us, so use it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
09e3a28892 |
btrfs: send: use helpers to access root_item::ctransid
We have helpers to access the on-disk item members, use that for root_item::ctransid instead of raw le64_to_cpu. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ab1405aa25 |
btrfs: generate lockdep keyset names at compile time
The names in btrfs_lockdep_keysets are generated from a simple pattern using snprintf but we can generate them directly with some macro magic and remove the helpers. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
387824afd7 |
btrfs: use the right number of levels for lockdep keysets
BTRFS_MAX_LEVEL is 8 and the keyset table is supposed to have a key for each level, but we'll never have more than 8 levels. The values passed to btrfs_set_buffer_lockdep_class are always derived from a valid extent buffer. Set the array sizes to the right value. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ecfdc08b8c |
btrfs: remove dio iomap DSYNC workaround
This effectively reverts 09745ff88d93 ("btrfs: dio iomap DSYNC
workaround") now that the iomap API has been updated to allow
iomap_dio_complete() not to be called under i_rwsem anymore.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
a42fa64316 |
btrfs: call iomap_dio_complete() without inode_lock
If direct writes are called with O_DIRECT | O_DSYNC, it will result in a
deadlock because iomap_dio_rw() is called under i_rwsem which calls:
iomap_dio_complete()
generic_write_sync()
btrfs_sync_file()
btrfs_sync_file() requires i_rwsem, so call __iomap_dio_rw() with the
i_rwsem locked, and call iomap_dio_complete() after unlocking i_rwsem.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
502756b380 |
btrfs: remove btrfs_inode::dio_sem
The inode dio_sem can be eliminated because all DIO synchronization is now performed through inode->i_rwsem that provides the same guarantees. This reduces btrfs_inode size by 40 bytes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e9adabb971 |
btrfs: use shared lock for direct writes within EOF
Direct writes within EOF are safe to be performed with inode shared lock to improve parallelization with other direct writes or reads because EOF is not changed and there is no race with truncate(). Direct reads are already performed under shared inode lock. This patch is precursor to removing btrfs_inode->dio_sem. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c352370633 |
btrfs: push inode locking and unlocking into buffered/direct write
Push inode locking and unlocking closer to where we perform the I/O. For this we need to move the write checks inside the respective functions as well. pos is evaluated after generic_write_checks because O_APPEND can change iocb->ki_pos. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a14b78ad06 |
btrfs: introduce btrfs_inode_lock()/unlock()
btrfs_inode_lock/unlock() are wrappers around inode locks, separating the type of lock and actual locking. - 0 - default, exclusive lock - BTRFS_ILOCK_SHARED - for shared locks, for possible parallel DIO - BTRFS_ILOCK_TRY - for the RWF_NOWAIT sequence The bits SHARED and TRY can be combined together. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b8d8e1fd57 |
btrfs: introduce btrfs_write_check()
btrfs_write_check() checks write parameters in one place before beginning a write. This does away with inode_unlock() after every check. In the later patches, it will help push inode_lock/unlock() in buffered and direct write functions respectively. generic_write_checks needs to be called before as it could truncate iov_iter and its return used as count. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c86537a42f |
btrfs: check FS error state bit early during write
fs_info::fs_state is a filesystem bit check as opposed to inode and can be performed before we begin with write checks. This eliminates inode lock/unlock in case the error bit is set. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5e8b9ef303 |
btrfs: move pos increment and pagecache extension to btrfs_buffered_write
While we do this, correct the call to pagecache_isize_extended: - pagecache_isize_extended needs to be called to the start of the write as opposed to i_size - we don't need to check range before the call, this is done in the function Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4e4cabece9 |
btrfs: split btrfs_direct_IO to read and write
The read and write DIO don't have anything in common except for the call to iomap_dio_rw. Extract the write call into a new function to get rid of conditional statements for direct write. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3d8cc17a05 |
btrfs: sysfs: add per-fs attribute for read policy
Add /sys/fs/btrfs/UUID/read_policy attribute so that the read policy for the raid1, raid1c34 and raid10 can be tuned. When this attribute is read, it will show all available policies, with active policy in [ ]. The read_policy attribute can be written using one of the items listed in there. For example: $ cat /sys/fs/btrfs/UUID/read_policy [pid] $ echo pid > /sys/fs/btrfs/UUID/read_policy Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
33fd2f714c |
btrfs: create read policy framework
As of now, we use the pid method to read striped mirrored data, which means process id determines the stripe id to read. This type of routing typically helps in a system with many small independent processes tying to read random data. On the other hand, the pid based read IO policy is inefficient because if there is a single process trying to read a large file, the overall disk bandwidth remains underutilized. So this patch introduces a read policy framework so that we could add more read policies, such as IO routing based on the device's wait-queue or manual when we have a read-preferred device or a policy based on the target storage caching. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
aaefed2078 |
btrfs: add helper for string match ignoring leading/trailing whitespace
Add a generic helper to match the string in a given buffer, and ignore the leading and trailing whitespace. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename variables, add comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
88090ad36a |
btrfs: do not start and wait for delalloc on snapshot roots on transaction commit
We do not need anymore to start writeback for delalloc of roots that are being snapshotted and wait for it to complete. This was done in commit |
|
|
|
196d59ab9c |
btrfs: switch extent buffer tree lock to rw_semaphore
Historically we've implemented our own locking because we wanted to be
able to selectively spin or sleep based on what we were doing in the
tree. For instance, if all of our nodes were in cache then there's
rarely a reason to need to sleep waiting for node locks, as they'll
likely become available soon. At the time this code was written the
rw_semaphore didn't do adaptive spinning, and thus was orders of
magnitude slower than our home grown locking.
However now the opposite is the case. There are a few problems with how
we implement blocking locks, namely that we use a normal waitqueue and
simply wake everybody up in reverse sleep order. This leads to some
suboptimal performance behavior, and a lot of context switches in highly
contended cases. The rw_semaphores actually do this properly, and also
have adaptive spinning that works relatively well.
The locking code is also a bit of a bear to understand, and we lose the
benefit of lockdep for the most part because the blocking states of the
lock are simply ad-hoc and not mapped into lockdep.
So rework the locking code to drop all of this custom locking stuff, and
simply use a rw_semaphore for everything. This makes the locking much
simpler for everything, as we can now drop a lot of cruft and blocking
transitions. The performance numbers vary depending on the workload,
because generally speaking there doesn't tend to be a lot of contention
on the btree. However, on my test system which is an 80 core single
socket system with 256GiB of RAM and a 2TiB NVMe drive I get the
following results (with all debug options off):
dbench 200 baseline
Throughput 216.056 MB/sec 200 clients 200 procs max_latency=1471.197 ms
dbench 200 with patch
Throughput 737.188 MB/sec 200 clients 200 procs max_latency=714.346 ms
Previously we also used fs_mark to test this sort of contention, and
those results are far less impressive, mostly because there's not enough
tasks to really stress the locking
fs_mark -d /d[0-15] -S 0 -L 20 -n 100000 -s 0 -t 16
baseline
Average Files/sec: 160166.7
p50 Files/sec: 165832
p90 Files/sec: 123886
p99 Files/sec: 123495
real 3m26.527s
user 2m19.223s
sys 48m21.856s
patched
Average Files/sec: 164135.7
p50 Files/sec: 171095
p90 Files/sec: 122889
p99 Files/sec: 113819
real 3m29.660s
user 2m19.990s
sys 44m12.259s
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
ecdcf3c259 |
btrfs: open code insert_orphan_item
Just open code it in its sole caller and remove a level of indirection. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9037d3cbcb |
btrfs: introduce mount option rescue=all
Now that we have the building blocks for some better recovery options with corrupted file systems, add a rescue=all option to enable all of the relevant rescue options. This will allow distros to simply default to rescue=all for the "oh dear lord the world's on fire" recovery without needing to know all the different options that we have and may add in the future. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
882dbe0cec |
btrfs: introduce mount option rescue=ignoredatacsums
There are cases where you can end up with bad data csums because of misbehaving applications. This happens when an application modifies a buffer in-flight when doing an O_DIRECT write. In order to recover the file we need a way to turn off data checksums so you can copy the file off, and then you can delete the file and restore it properly later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
42437a6386 |
btrfs: introduce mount option rescue=ignorebadroots
In the face of extent root corruption, or any other core fs wide root corruption we will fail to mount the file system. This makes recovery kind of a pain, because you need to fall back to userspace tools to scrape off data. Instead provide a mechanism to gracefully handle bad roots, so we can at least mount read-only and possibly recover data from the file system. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
68319c18cb |
btrfs: show rescue=usebackuproot in /proc/mounts
The standalone option usebackuproot was intended as one-time use and it was not necessary to keep it in the option list. Now that we're going to have more rescue options, it's desirable to keep them intact as it could be confusing why the option disappears. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ remove the btrfs_clear_opt part from open_ctree ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ab0b4a3ebf |
btrfs: add a helper to print out rescue= options
We're going to have a lot of rescue options, add a helper to collapse the /proc/mounts output to rescue=option1:option2:option3 format. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ceafe3cc39 |
btrfs: sysfs: export supported rescue= mount options
We're going to be adding a variety of different rescue options, we should advertise which ones we support to make user spaces life easier in the future. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
334c16d82c |
btrfs: push the NODATASUM check into btrfs_lookup_bio_sums
When we move to being able to handle NULL csum_roots it'll be cleaner to just check in btrfs_lookup_bio_sums instead of at all of the caller locations, so push the NODATASUM check into it as well so it's unified. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d70bf7484f |
btrfs: unify the ro checking for mount options
We're going to be adding more options that require RDONLY, so add a helper to do the check and error out if we don't have RDONLY set. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a6889caf6e |
btrfs: do not start readahead for csum tree when scrubbing non-data block groups
When scrubbing a stripe of a block group we always start readahead for the checksums btree and wait for it to complete, however when the blockgroup is not a data block group (or a mixed block group) it is a waste of time to do it, since there are no checksums for metadata extents in that btree. So skip that when the block group does not have the data flag set, saving some time doing memory allocations, queueing a job in the readahead work queue, waiting for it to complete and potentially avoiding some IO as well (when csum tree extents are not in memory already). Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a57ad681f1 |
btrfs: assert we are holding the reada_lock when releasing a readahead zone
When we drop the last reference of a zone, we end up releasing it through the callback reada_zone_release(), which deletes the zone from a device's reada_zones radix tree. This tree is protected by the global readahead lock at fs_info->reada_lock. Currently all places that are sure that they are dropping the last reference on a zone, are calling kref_put() in a critical section delimited by this lock, while all other places that are sure they are not dropping the last reference, do not bother calling kref_put() while holding that lock. When working on the previous fix for hangs and use-after-frees in the readahead code, my initial attempts were different and I actually ended up having reada_zone_release() called when not holding the lock, which resulted in weird and unexpected problems. So just add an assertion there to detect such problem more quickly and make the dependency more obvious. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
aa8c1a41a1 |
btrfs: set EXTENT_NORESERVE bits side btrfs_dirty_pages()
Set the extent bits EXTENT_NORESERVE inside btrfs_dirty_pages() as opposed to calling set_extent_bits again later. Fold check for written length within the function. Note: EXTENT_NORESERVE is set before unlocking extents. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
13f0dd8f78 |
btrfs: use round_down while calculating start position in btrfs_dirty_pages()
round_down looks prettier than the bit mask operations. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
949b32732e |
btrfs: use iosize while reading compressed pages
While using compression, a submitted bio is mapped with a compressed bio which performs the read from disk, decompresses and returns uncompressed data to original bio. The original bio must reflect the uncompressed size (iosize) of the I/O to be performed, or else the page just gets the decompressed I/O length of data (disk_io_size). The compressed bio checks the extent map and gets the correct length while performing the I/O from disk. This came up in subpage work when only compressed length of the original bio was filled in the page. This worked correctly for pagesize == sectorsize because both compressed and uncompressed data are at pagesize boundaries, and would end up filling the requested page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
eefa45f593 |
btrfs: calculate num_pages, reserve_bytes once in btrfs_buffered_write
write_bytes can change in btrfs_check_nocow_lock(). Calculate variables such as num_pages and reserve_bytes once we are sure of the value of write_bytes so there is no need to re-calculate. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb8a7e941b |
btrfs: calculate more accurate remaining time to sleep in transaction_kthread
If transaction_kthread is woken up before btrfs_fs_info::commit_interval seconds have elapsed it will sleep for a fixed period of 5 seconds. This is not a problem per-se but is not accurate. Instead the code should sleep for an interval which guarantees on next wakeup commit_interval would have passed. Since time tracking is not precise subtract 1 second from delta to ensure the delay we end up waiting will be longer than than the wake up period. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
643900bee4 |
btrfs: record delta directly in transaction_kthread
Rename 'now' to 'delta' and store there the delta between transaction start time and current time. This is in preparation for optimising the sleep logic in the next patch. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e4e4288161 |
btrfs: remove redundant time check in transaction kthread loop
The value obtained from ktime_get_seconds() is guaranteed to be monotonically increasing since it's taken from CLOCK_MONOTONIC. As transaction_kthread obtains a reference to the currently running transaction under holding btrfs_fs_info::trans_lock it's guaranteed to: a) see an initialized 'cur', whose start_time is guaranteed to be smaller than 'now' or b) not obtain a 'cur' and simply go to sleep. Given this remove the unnecessary check, if it sees now < cur->start_time this would imply there are far greater problems on the machine. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ba1bc00f35 |
btrfs: use helpers to convert from seconds to jiffies in transaction_kthread
The kernel provides easy to understand helpers to convert from human understandable units to the kernel-friendly 'jiffies'. So let's use those to make the code easier to understand. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
089c8b0551 |
btrfs: sysfs: export filesystem generation
Matching with the information that's available from the ioctl FS_INFO, add generation to the per-filesystem directory /sys/fs/btrfs/UUID/generation, which could be used by scripts. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4e7b5671c6 |
block: remove i_bdev
Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
8d65269fe8 |
block: add a bdev_kobj helper
Add a little helper to find the kobject for a struct block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: David Sterba <dsterba@suse.com> [btrfs] Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
a17a3ca55e |
for-5.10-rc5-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl/BGBQACgkQxWXV+ddt
WDtRYQ/+IUGjJ4l6MyL3PwLgTVabKsSpm2R3y3M/0tVJ0FIhDXYbkpjB2CkpIdcH
jUvHnRL4H59hwG6rwlXU2oC298FNbbrLGIU+c9DR50RuyQCDGnT02XvxwfDIEFzp
WLNZ/CAhRkm6boj//70lV26BpeXT59KMYwNixfCNTXq9Ir0qYHHCGg4cEBQLS++2
JUU8XVTLURIYiFOLbwmABI7V43OgDhdORIr+qnR8xjCUyhusZsjVVbvIdW3BDi/S
wK7NJsfuqgsF0zD9URJjpwTFiJL9SvBLWR8JnM9NiLW3ZbkGBL+efL6mdWuH7534
gruJRS2zYPMO2/Kpjy/31CWLap3PUSD3i8cKF+uo3liojWuSUhN8kfvcNxJVd5se
NkEK+4zOjsDIVbv7gcjThSv4KTnOUO/XfN9TWUMuduaMBmGQNaQut1FpGV98utiK
yW6x8xqcR4SI+lqY6ILqCK+qUHf19BLSsuyzZdTIontKKRA9F9hY8a4XTZuzTWml
BGYmFGP640vOo8C9GjrQfpAwa7CB/DnF/cg1AAmuZ8vrEm9zYjmauFKK8ZPcveA3
KGrnmIlYjhAIX16oRbfwOgj9D2xa1loBzJyHQHByvCMXGVFBnqRTRANRHFrdQWJB
qh9+J4EJcUXPE9WGHxAW/g9vpFkV7IRABHs7aUB8zApxI9nGA0Q=
=kcxn
-----END PGP SIGNATURE-----
Merge tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few fixes for various warnings that accumulated over past two weeks:
- tree-checker: add missing return values for some errors
- lockdep fixes
- when reading qgroup config and starting quota rescan
- reverse order of quota ioctl lock and VFS freeze lock
- avoid accessing potentially stale fs info during device scan,
reported by syzbot
- add scope NOFS protection around qgroup relation changes
- check for running transaction before flushing qgroups
- fix tracking of new delalloc ranges for some cases"
* tag 'for-5.10-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix lockdep splat when enabling and disabling qgroups
btrfs: do nofs allocations when adding and removing qgroup relations
btrfs: fix lockdep splat when reading qgroup config on mount
btrfs: tree-checker: add missing returns after data_ref alignment checks
btrfs: don't access possibly stale fs_info data for printing duplicate device
btrfs: tree-checker: add missing return after error in root_item
btrfs: qgroup: don't commit transaction when we already hold the handle
btrfs: fix missing delalloc new bit for new delalloc ranges
|
|
|
|
a855fbe692 |
btrfs: fix lockdep splat when enabling and disabling qgroups
When running test case btrfs/017 from fstests, lockdep reported the following splat: [ 1297.067385] ====================================================== [ 1297.067708] WARNING: possible circular locking dependency detected [ 1297.068022] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 1297.068322] ------------------------------------------------------ [ 1297.068629] btrfs/189080 is trying to acquire lock: [ 1297.068929] ffff9f2725731690 (sb_internal#2){.+.+}-{0:0}, at: btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.069274] but task is already holding lock: [ 1297.069868] ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.070219] which lock already depends on the new lock. [ 1297.071131] the existing dependency chain (in reverse order) is: [ 1297.071721] -> #1 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}: [ 1297.072375] lock_acquire+0xd8/0x490 [ 1297.072710] __mutex_lock+0xa3/0xb30 [ 1297.073061] btrfs_qgroup_inherit+0x59/0x6a0 [btrfs] [ 1297.073421] create_subvol+0x194/0x990 [btrfs] [ 1297.073780] btrfs_mksubvol+0x3fb/0x4a0 [btrfs] [ 1297.074133] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs] [ 1297.074498] btrfs_ioctl_snap_create+0x58/0x80 [btrfs] [ 1297.074872] btrfs_ioctl+0x1a90/0x36f0 [btrfs] [ 1297.075245] __x64_sys_ioctl+0x83/0xb0 [ 1297.075617] do_syscall_64+0x33/0x80 [ 1297.075993] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.076380] -> #0 (sb_internal#2){.+.+}-{0:0}: [ 1297.077166] check_prev_add+0x91/0xc60 [ 1297.077572] __lock_acquire+0x1740/0x3110 [ 1297.077984] lock_acquire+0xd8/0x490 [ 1297.078411] start_transaction+0x3c5/0x760 [btrfs] [ 1297.078853] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.079323] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.079789] __x64_sys_ioctl+0x83/0xb0 [ 1297.080232] do_syscall_64+0x33/0x80 [ 1297.080680] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.081139] other info that might help us debug this: [ 1297.082536] Possible unsafe locking scenario: [ 1297.083510] CPU0 CPU1 [ 1297.084005] ---- ---- [ 1297.084500] lock(&fs_info->qgroup_ioctl_lock); [ 1297.084994] lock(sb_internal#2); [ 1297.085485] lock(&fs_info->qgroup_ioctl_lock); [ 1297.085974] lock(sb_internal#2); [ 1297.086454] *** DEADLOCK *** [ 1297.087880] 3 locks held by btrfs/189080: [ 1297.088324] #0: ffff9f2725731470 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0xa73/0x36f0 [btrfs] [ 1297.088799] #1: ffff9f2702b60cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.089284] #2: ffff9f2702b61a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0xa70 [btrfs] [ 1297.089771] stack backtrace: [ 1297.090662] CPU: 5 PID: 189080 Comm: btrfs Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 1297.091132] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1297.092123] Call Trace: [ 1297.092629] dump_stack+0x8d/0xb5 [ 1297.093115] check_noncircular+0xff/0x110 [ 1297.093596] check_prev_add+0x91/0xc60 [ 1297.094076] ? kvm_clock_read+0x14/0x30 [ 1297.094553] ? kvm_sched_clock_read+0x5/0x10 [ 1297.095029] __lock_acquire+0x1740/0x3110 [ 1297.095510] lock_acquire+0xd8/0x490 [ 1297.095993] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.096476] start_transaction+0x3c5/0x760 [btrfs] [ 1297.096962] ? btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097451] btrfs_quota_enable+0xaf/0xa70 [btrfs] [ 1297.097941] ? btrfs_ioctl+0x1f4d/0x36f0 [btrfs] [ 1297.098429] btrfs_ioctl+0x2c60/0x36f0 [btrfs] [ 1297.098904] ? do_user_addr_fault+0x20c/0x430 [ 1297.099382] ? kvm_clock_read+0x14/0x30 [ 1297.099854] ? kvm_sched_clock_read+0x5/0x10 [ 1297.100328] ? sched_clock+0x5/0x10 [ 1297.100801] ? sched_clock_cpu+0x12/0x180 [ 1297.101272] ? __x64_sys_ioctl+0x83/0xb0 [ 1297.101739] __x64_sys_ioctl+0x83/0xb0 [ 1297.102207] do_syscall_64+0x33/0x80 [ 1297.102673] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1297.103148] RIP: 0033:0x7f773ff65d87 This is because during the quota enable ioctl we lock first the mutex qgroup_ioctl_lock and then start a transaction, and starting a transaction acquires a fs freeze semaphore (at the VFS level). However, every other code path, except for the quota disable ioctl path, we do the opposite: we start a transaction and then lock the mutex. So fix this by making the quota enable and disable paths to start the transaction without having the mutex locked, and then, after starting the transaction, lock the mutex and check if some other task already enabled or disabled the quotas, bailing with success if that was the case. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7aa6d35984 |
btrfs: do nofs allocations when adding and removing qgroup relations
When adding or removing a qgroup relation we are doing a GFP_KERNEL allocation which is not safe because we are holding a transaction handle open and that can make us deadlock if the allocator needs to recurse into the filesystem. So just surround those calls with a nofs context. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3d05cad3c3 |
btrfs: fix lockdep splat when reading qgroup config on mount
Lockdep reported the following splat when running test btrfs/190 from fstests: [ 9482.126098] ====================================================== [ 9482.126184] WARNING: possible circular locking dependency detected [ 9482.126281] 5.10.0-rc4-btrfs-next-73 #1 Not tainted [ 9482.126365] ------------------------------------------------------ [ 9482.126456] mount/24187 is trying to acquire lock: [ 9482.126534] ffffa0c869a7dac0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}, at: qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.126647] but task is already holding lock: [ 9482.126777] ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.126886] which lock already depends on the new lock. [ 9482.127078] the existing dependency chain (in reverse order) is: [ 9482.127213] -> #1 (btrfs-quota-00){++++}-{3:3}: [ 9482.127366] lock_acquire+0xd8/0x490 [ 9482.127436] down_read_nested+0x45/0x220 [ 9482.127528] __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.127613] btrfs_read_lock_root_node+0x41/0x130 [btrfs] [ 9482.127702] btrfs_search_slot+0x514/0xc30 [btrfs] [ 9482.127788] update_qgroup_status_item+0x72/0x140 [btrfs] [ 9482.127877] btrfs_qgroup_rescan_worker+0xde/0x680 [btrfs] [ 9482.127964] btrfs_work_helper+0xf1/0x600 [btrfs] [ 9482.128039] process_one_work+0x24e/0x5e0 [ 9482.128110] worker_thread+0x50/0x3b0 [ 9482.128181] kthread+0x153/0x170 [ 9482.128256] ret_from_fork+0x22/0x30 [ 9482.128327] -> #0 (&fs_info->qgroup_rescan_lock){+.+.}-{3:3}: [ 9482.128464] check_prev_add+0x91/0xc60 [ 9482.128551] __lock_acquire+0x1740/0x3110 [ 9482.128623] lock_acquire+0xd8/0x490 [ 9482.130029] __mutex_lock+0xa3/0xb30 [ 9482.130590] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.131577] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.132175] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.132756] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.133325] legacy_get_tree+0x30/0x60 [ 9482.133866] vfs_get_tree+0x28/0xe0 [ 9482.134392] fc_mount+0xe/0x40 [ 9482.134908] vfs_kern_mount.part.0+0x71/0x90 [ 9482.135428] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.135942] legacy_get_tree+0x30/0x60 [ 9482.136444] vfs_get_tree+0x28/0xe0 [ 9482.136949] path_mount+0x2d7/0xa70 [ 9482.137438] do_mount+0x75/0x90 [ 9482.137923] __x64_sys_mount+0x8e/0xd0 [ 9482.138400] do_syscall_64+0x33/0x80 [ 9482.138873] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.139346] other info that might help us debug this: [ 9482.140735] Possible unsafe locking scenario: [ 9482.141594] CPU0 CPU1 [ 9482.142011] ---- ---- [ 9482.142411] lock(btrfs-quota-00); [ 9482.142806] lock(&fs_info->qgroup_rescan_lock); [ 9482.143216] lock(btrfs-quota-00); [ 9482.143629] lock(&fs_info->qgroup_rescan_lock); [ 9482.144056] *** DEADLOCK *** [ 9482.145242] 2 locks held by mount/24187: [ 9482.145637] #0: ffffa0c8411c40e8 (&type->s_umount_key#44/1){+.+.}-{3:3}, at: alloc_super+0xb9/0x400 [ 9482.146061] #1: ffffa0c892ebd3a0 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.146509] stack backtrace: [ 9482.147350] CPU: 1 PID: 24187 Comm: mount Not tainted 5.10.0-rc4-btrfs-next-73 #1 [ 9482.147788] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 9482.148709] Call Trace: [ 9482.149169] dump_stack+0x8d/0xb5 [ 9482.149628] check_noncircular+0xff/0x110 [ 9482.150090] check_prev_add+0x91/0xc60 [ 9482.150561] ? kvm_clock_read+0x14/0x30 [ 9482.151017] ? kvm_sched_clock_read+0x5/0x10 [ 9482.151470] __lock_acquire+0x1740/0x3110 [ 9482.151941] ? __btrfs_tree_read_lock+0x27/0x120 [btrfs] [ 9482.152402] lock_acquire+0xd8/0x490 [ 9482.152887] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.153354] __mutex_lock+0xa3/0xb30 [ 9482.153826] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154301] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.154768] ? qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155226] qgroup_rescan_init+0x43/0xf0 [btrfs] [ 9482.155690] btrfs_read_qgroup_config+0x43a/0x550 [btrfs] [ 9482.156160] open_ctree+0x1228/0x18a0 [btrfs] [ 9482.156643] btrfs_mount_root.cold+0x13/0xed [btrfs] [ 9482.157108] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.157567] ? kfree+0x31f/0x3e0 [ 9482.158030] legacy_get_tree+0x30/0x60 [ 9482.158489] vfs_get_tree+0x28/0xe0 [ 9482.158947] fc_mount+0xe/0x40 [ 9482.159403] vfs_kern_mount.part.0+0x71/0x90 [ 9482.159875] btrfs_mount+0x13b/0x3e0 [btrfs] [ 9482.160335] ? rcu_read_lock_sched_held+0x5d/0x90 [ 9482.160805] ? kfree+0x31f/0x3e0 [ 9482.161260] ? legacy_get_tree+0x30/0x60 [ 9482.161714] legacy_get_tree+0x30/0x60 [ 9482.162166] vfs_get_tree+0x28/0xe0 [ 9482.162616] path_mount+0x2d7/0xa70 [ 9482.163070] do_mount+0x75/0x90 [ 9482.163525] __x64_sys_mount+0x8e/0xd0 [ 9482.163986] do_syscall_64+0x33/0x80 [ 9482.164437] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 9482.164902] RIP: 0033:0x7f51e907caaa This happens because at btrfs_read_qgroup_config() we can call qgroup_rescan_init() while holding a read lock on a quota btree leaf, acquired by the previous call to btrfs_search_slot_for_read(), and qgroup_rescan_init() acquires the mutex qgroup_rescan_lock. A qgroup rescan worker does the opposite: it acquires the mutex qgroup_rescan_lock, at btrfs_qgroup_rescan_worker(), and then tries to update the qgroup status item in the quota btree through the call to update_qgroup_status_item(). This inversion of locking order between the qgroup_rescan_lock mutex and quota btree locks causes the splat. Fix this simply by releasing and freeing the path before calling qgroup_rescan_init() at btrfs_read_qgroup_config(). CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6d06b0ad94 |
btrfs: tree-checker: add missing returns after data_ref alignment checks
There are sectorsize alignment checks that are reported but then
check_extent_data_ref continues. This was not intended, wrong alignment
is not a minor problem and we should return with error.
CC: stable@vger.kernel.org # 5.4+
Fixes:
|
|
|
|
0697d9a610 |
btrfs: don't access possibly stale fs_info data for printing duplicate device
Syzbot reported a possible use-after-free when printing a duplicate device warning device_list_add(). At this point it can happen that a btrfs_device::fs_info is not correctly setup yet, so we're accessing stale data, when printing the warning message using the btrfs_printk() wrappers. ================================================================== BUG: KASAN: use-after-free in btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 Read of size 8 at addr ffff8880878e06a8 by task syz-executor225/7068 CPU: 1 PID: 7068 Comm: syz-executor225 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1d6/0x29e lib/dump_stack.c:118 print_address_description+0x66/0x620 mm/kasan/report.c:383 __kasan_report mm/kasan/report.c:513 [inline] kasan_report+0x132/0x1d0 mm/kasan/report.c:530 btrfs_printk+0x3eb/0x435 fs/btrfs/super.c:245 device_list_add+0x1a88/0x1d60 fs/btrfs/volumes.c:943 btrfs_scan_one_device+0x196/0x490 fs/btrfs/volumes.c:1359 btrfs_mount_root+0x48f/0xb60 fs/btrfs/super.c:1634 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x44840a RSP: 002b:00007ffedfffd608 EFLAGS: 00000293 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 00007ffedfffd670 RCX: 000000000044840a RDX: 0000000020000000 RSI: 0000000020000100 RDI: 00007ffedfffd630 RBP: 00007ffedfffd630 R08: 00007ffedfffd670 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000001a R13: 0000000000000004 R14: 0000000000000003 R15: 0000000000000003 Allocated by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track mm/kasan/common.c:56 [inline] __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461 kmalloc_node include/linux/slab.h:577 [inline] kvmalloc_node+0x81/0x110 mm/util.c:574 kvmalloc include/linux/mm.h:757 [inline] kvzalloc include/linux/mm.h:765 [inline] btrfs_mount_root+0xd0/0xb60 fs/btrfs/super.c:1613 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Freed by task 6945: kasan_save_stack mm/kasan/common.c:48 [inline] kasan_set_track+0x3d/0x70 mm/kasan/common.c:56 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422 __cache_free mm/slab.c:3418 [inline] kfree+0x113/0x200 mm/slab.c:3756 deactivate_locked_super+0xa7/0xf0 fs/super.c:335 btrfs_mount_root+0x72b/0xb60 fs/btrfs/super.c:1678 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 fc_mount fs/namespace.c:978 [inline] vfs_kern_mount+0xc9/0x160 fs/namespace.c:1008 btrfs_mount+0x33c/0xae0 fs/btrfs/super.c:1732 legacy_get_tree+0xea/0x180 fs/fs_context.c:592 vfs_get_tree+0x88/0x270 fs/super.c:1547 do_new_mount fs/namespace.c:2875 [inline] path_mount+0x179d/0x29e0 fs/namespace.c:3192 do_mount fs/namespace.c:3205 [inline] __do_sys_mount fs/namespace.c:3413 [inline] __se_sys_mount+0x126/0x180 fs/namespace.c:3390 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The buggy address belongs to the object at ffff8880878e0000 which belongs to the cache kmalloc-16k of size 16384 The buggy address is located 1704 bytes inside of 16384-byte region [ffff8880878e0000, ffff8880878e4000) The buggy address belongs to the page: page:0000000060704f30 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x878e0 head:0000000060704f30 order:3 compound_mapcount:0 compound_pincount:0 flags: 0xfffe0000010200(slab|head) raw: 00fffe0000010200 ffffea00028e9a08 ffffea00021e3608 ffff8880aa440b00 raw: 0000000000000000 ffff8880878e0000 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff8880878e0580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff8880878e0680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff8880878e0700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff8880878e0780: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== The syzkaller reproducer for this use-after-free crafts a filesystem image and loop mounts it twice in a loop. The mount will fail as the crafted image has an invalid chunk tree. When this happens btrfs_mount_root() will call deactivate_locked_super(), which then cleans up fs_info and fs_info::sb. If a second thread now adds the same block-device to the filesystem, it will get detected as a duplicate device and device_list_add() will reject the duplicate and print a warning. But as the fs_info pointer passed in is non-NULL this will result in a use-after-free. Instead of printing possibly uninitialized or already freed memory in btrfs_printk(), explicitly pass in a NULL fs_info so the printing of the device name will be skipped altogether. There was a slightly different approach discussed in https://lore.kernel.org/linux-btrfs/20200114060920.4527-1-anand.jain@oracle.com/t/#u Link: https://lore.kernel.org/linux-btrfs/000000000000c9e14b05afcc41ba@google.com Reported-by: syzbot+582e66e5edf36a22c7b0@syzkaller.appspotmail.com CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1a49a97df6 |
btrfs: tree-checker: add missing return after error in root_item
There's a missing return statement after an error is found in the
root_item, this can cause further problems when a crafted image triggers
the error.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210181
Fixes:
|
|
|
|
6f23277a49 |
btrfs: qgroup: don't commit transaction when we already hold the handle
[BUG]
When running the following script, btrfs will trigger an ASSERT():
#/bin/bash
mkfs.btrfs -f $dev
mount $dev $mnt
xfs_io -f -c "pwrite 0 1G" $mnt/file
sync
btrfs quota enable $mnt
btrfs quota rescan -w $mnt
# Manually set the limit below current usage
btrfs qgroup limit 512M $mnt $mnt
# Crash happens
touch $mnt/file
The dmesg looks like this:
assertion failed: refcount_read(&trans->use_count) == 1, in fs/btrfs/transaction.c:2022
------------[ cut here ]------------
kernel BUG at fs/btrfs/ctree.h:3230!
invalid opcode: 0000 [#1] SMP PTI
RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
btrfs_commit_transaction.cold+0x11/0x5d [btrfs]
try_flush_qgroup+0x67/0x100 [btrfs]
__btrfs_qgroup_reserve_meta+0x3a/0x60 [btrfs]
btrfs_delayed_update_inode+0xaa/0x350 [btrfs]
btrfs_update_inode+0x9d/0x110 [btrfs]
btrfs_dirty_inode+0x5d/0xd0 [btrfs]
touch_atime+0xb5/0x100
iterate_dir+0xf1/0x1b0
__x64_sys_getdents64+0x78/0x110
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb5afe588db
[CAUSE]
In try_flush_qgroup(), we assume we don't hold a transaction handle at
all. This is true for data reservation and mostly true for metadata.
Since data space reservation always happens before we start a
transaction, and for most metadata operation we reserve space in
start_transaction().
But there is an exception, btrfs_delayed_inode_reserve_metadata().
It holds a transaction handle, while still trying to reserve extra
metadata space.
When we hit EDQUOT inside btrfs_delayed_inode_reserve_metadata(), we
will join current transaction and commit, while we still have
transaction handle from qgroup code.
[FIX]
Let's check current->journal before we join the transaction.
If current->journal is unset or BTRFS_SEND_TRANS_STUB, it means
we are not holding a transaction, thus are able to join and then commit
transaction.
If current->journal is a valid transaction handle, we avoid committing
transaction and just end it
This is less effective than committing current transaction, as it won't
free metadata reserved space, but we may still free some data space
before new data writes.
Bugzilla: https://bugzilla.suse.com/show_bug.cgi?id=1178634
Fixes:
|
|
|
|
c334730988 |
btrfs: fix missing delalloc new bit for new delalloc ranges
When doing a buffered write, through one of the write family syscalls, we look for ranges which currently don't have allocated extents and set the 'delalloc new' bit on them, so that we can report a correct number of used blocks to the stat(2) syscall until delalloc is flushed and ordered extents complete. However there are a few other places where we can do a buffered write against a range that is mapped to a hole (no extent allocated) and where we do not set the 'new delalloc' bit. Those places are: - Doing a memory mapped write against a hole; - Cloning an inline extent into a hole starting at file offset 0; - Calling btrfs_cont_expand() when the i_size of the file is not aligned to the sector size and is located in a hole. For example when cloning to a destination offset beyond EOF. So after such cases, until the corresponding delalloc range is flushed and the respective ordered extents complete, we can report an incorrect number of blocks used through the stat(2) syscall. In some cases we can end up reporting 0 used blocks to stat(2), which is a particular bad value to report as it may mislead tools to think a file is completely sparse when its i_size is not zero, making them skip reading any data, an undesired consequence for tools such as archivers and other backup tools, as reported a long time ago in the following thread (and other past threads): https://lists.gnu.org/archive/html/bug-tar/2016-07/msg00001.html Example reproducer: $ cat reproducer.sh #!/bin/bash MNT=/mnt/sdi DEV=/dev/sdi mkfs.btrfs -f $DEV > /dev/null # mkfs.xfs -f $DEV > /dev/null # mkfs.ext4 -F $DEV > /dev/null # mkfs.f2fs -f $DEV > /dev/null mount $DEV $MNT xfs_io -f -c "truncate 64K" \ -c "mmap -w 0 64K" \ -c "mwrite -S 0xab 0 64K" \ -c "munmap" \ $MNT/foo blocks_used=$(stat -c %b $MNT/foo) echo "blocks used: $blocks_used" if [ $blocks_used -eq 0 ]; then echo "ERROR: blocks used is 0" fi umount $DEV $ ./reproducer.sh blocks used: 0 ERROR: blocks used is 0 So move the logic that decides to set the 'delalloc bit' bit into the function btrfs_set_extent_delalloc(), since that is what we use for all those missing cases as well as for the cases that currently work well. This change is also preparatory work for an upcoming patch that fixes other problems related to tracking and reporting the number of bytes used by an inode. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e2f0c565ec |
for-5.10-rc3-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl+qtyAACgkQxWXV+ddt WDusYg/+P1SroRe2n33Pi6v9w47Luqjfi5qMfdrV/ex7g9bP/7dVPvAa0YJr7CpA UHpvcgWMFK2e29oOoeEoYXukHQ4BKtC6F5L0MPgVJocDT6xsAWM3v98VZn69Olcu TcUNkxUVbj7OBbQDaINV8dXnTbQWNOsfNlYXH+nPgWqrjSDbPoLkXIEVfZ9CTBeZ P/qEshkTXqvx1Pux/uRcKrMSu+lSFICKlLvky0b9gRpg4usVlF8jlGQrvJHQvqnP lECR1cb7/nf2PQ+HdPpgigD24bddiiORoyGW68Q1zZHgs+kGfL6p4M3WF04WINrV Taiv7WVZ6qHiEB+LDxlOx2cy0Z6YzFOaGSASz+Hh64hvOezBOVGvCF1U9U/Dp+MC n6QjUiw6c0rIjbdoxpTfQETCdIt/l3qXfOVEr9Zjr2KEbasLsZXdGSf3ydr81Uff 94CwrXp2wq429zu4mdCfOwihF/288+VrN8XRfkSy5RFQ5hHVnZBFQO4KbRIQ4i5X ZIjHQPX0jA/XN/jpUde/RJL5AyLz20n0o9I3frjXwSs+rvU3f0wD//fxmXlRUshM hsFXFKO0VdaFtoywVIf7VK/fDsKQhiq+9Yg48A8ylpk+W7meMjeDYuYMIEhMQX3m S1OMG1Qf27pWXD6KEzpaqzI4SrYBOGVDsX8qxMxRws7n55koVJc= =3FTR -----END PGP SIGNATURE----- Merge tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A handful of minor fixes and updates: - handle missing device replace item on mount (syzbot report) - fix space reservation calculation when finishing relocation - fix memory leak on error path in ref-verify (debugging feature) - fix potential overflow during defrag on 32bit arches - minor code update to silence smatch warning - minor error message updates" * tag 'for-5.10-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod btrfs: dev-replace: fail mount if we don't have replace item with target device btrfs: scrub: update message regarding read-only status btrfs: clean up NULL checks in qgroup_unreserve_range() btrfs: fix min reserved size calculation in merge_reloc_root btrfs: print the block rsv type when we fail our reservation btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch |
|
|
|
e8f147dc3f |
fs: Remove asm/kmap_types.h includes
Historical leftovers from the time where kmap() had fixed slots. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: David Sterba <dsterba@suse.com> Cc: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/r/20201103095856.870272797@linutronix.de |
|
|
|
468600c6ec |
btrfs: ref-verify: fix memory leak in btrfs_ref_tree_mod
There is one error handling path that does not free ref, which may cause a minor memory leak. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cf89af146b |
btrfs: dev-replace: fail mount if we don't have replace item with target device
If there is a device BTRFS_DEV_REPLACE_DEVID without the device replace item, then it means the filesystem is inconsistent state. This is either corruption or a crafted image. Fail the mount as this needs a closer look what is actually wrong. As of now if BTRFS_DEV_REPLACE_DEVID is present without the replace item, in __btrfs_free_extra_devids() we determine that there is an extra device, and free those extra devices but continue to mount the device. However, we were wrong in keeping tack of the rw_devices so the syzbot testcase failed: WARNING: CPU: 1 PID: 3612 at fs/btrfs/volumes.c:1166 close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 3612 Comm: syz-executor.2 Not tainted 5.9.0-rc4-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 panic+0x347/0x7c0 kernel/panic.c:231 __warn.cold+0x20/0x46 kernel/panic.c:600 report_bug+0x1bd/0x210 lib/bug.c:198 handle_bug+0x38/0x90 arch/x86/kernel/traps.c:234 exc_invalid_op+0x14/0x40 arch/x86/kernel/traps.c:254 asm_exc_invalid_op+0x12/0x20 arch/x86/include/asm/idtentry.h:536 RIP: 0010:close_fs_devices.part.0+0x607/0x800 fs/btrfs/volumes.c:1166 RSP: 0018:ffffc900091777e0 EFLAGS: 00010246 RAX: 0000000000040000 RBX: ffffffffffffffff RCX: ffffc9000c8b7000 RDX: 0000000000040000 RSI: ffffffff83097f47 RDI: 0000000000000007 RBP: dffffc0000000000 R08: 0000000000000001 R09: ffff8880988a187f R10: 0000000000000000 R11: 0000000000000001 R12: ffff88809593a130 R13: ffff88809593a1ec R14: ffff8880988a1908 R15: ffff88809593a050 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 open_ctree+0x4984/0x4a2d fs/btrfs/disk-io.c:3434 btrfs_fill_super fs/btrfs/super.c:1316 [inline] btrfs_mount_root.cold+0x14/0x165 fs/btrfs/super.c:1672 The fix here is, when we determine that there isn't a replace item then fail the mount if there is a replace target device (devid 0). CC: stable@vger.kernel.org # 4.19+ Reported-by: syzbot+4cfe71a4da060be47502@syzkaller.appspotmail.com Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a4852cf268 |
btrfs: scrub: update message regarding read-only status
Based on user feedback update the message printed when scrub fails to start due to write requirements. To make a distinction add a device id to the messages. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f07728d541 |
btrfs: clean up NULL checks in qgroup_unreserve_range()
Smatch complains that this code dereferences "entry" before checking whether it's NULL on the next line. Fortunately, rb_entry() will never return NULL so it doesn't cause a problem. We can clean up the NULL checking a bit to silence the warning and make the code more clear. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fca3a45d08 |
btrfs: fix min reserved size calculation in merge_reloc_root
The minimum reserve size was adjusted to take into account the height of
the tree we are merging, however we can have a root with a level == 0.
What we want is root_level + 1 to get the number of nodes we may have to
cow. This fixes the enospc_debug warning pops with btrfs/101.
Nikolay: this fixes failures on btrfs/060 btrfs/062 btrfs/063 and
btrfs/195 That I was seeing, the call trace was:
[ 3680.515564] ------------[ cut here ]------------
[ 3680.515566] BTRFS: block rsv returned -28
[ 3680.515585] WARNING: CPU: 2 PID: 8339 at fs/btrfs/block-rsv.c:521 btrfs_use_block_rsv+0x162/0x180
[ 3680.515587] Modules linked in:
[ 3680.515591] CPU: 2 PID: 8339 Comm: btrfs Tainted: G W 5.9.0-rc8-default #95
[ 3680.515593] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
[ 3680.515595] RIP: 0010:btrfs_use_block_rsv+0x162/0x180
[ 3680.515600] RSP: 0018:ffffa01ac9753910 EFLAGS: 00010282
[ 3680.515602] RAX: 0000000000000000 RBX: ffff984b34200000 RCX: 0000000000000027
[ 3680.515604] RDX: 0000000000000027 RSI: 0000000000000000 RDI: ffff984b3bd19e28
[ 3680.515606] RBP: 0000000000004000 R08: ffff984b3bd19e20 R09: 0000000000000001
[ 3680.515608] R10: 0000000000000004 R11: 0000000000000046 R12: ffff984b264fdc00
[ 3680.515609] R13: ffff984b13149000 R14: 00000000ffffffe4 R15: ffff984b34200000
[ 3680.515613] FS: 00007f4e2912b8c0(0000) GS:ffff984b3bd00000(0000) knlGS:0000000000000000
[ 3680.515615] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3680.515617] CR2: 00007fab87122150 CR3: 0000000118e42000 CR4: 00000000000006e0
[ 3680.515620] Call Trace:
[ 3680.515627] btrfs_alloc_tree_block+0x8b/0x340
[ 3680.515633] ? __lock_acquire+0x51a/0xac0
[ 3680.515646] alloc_tree_block_no_bg_flush+0x4f/0x60
[ 3680.515651] __btrfs_cow_block+0x14e/0x7e0
[ 3680.515662] btrfs_cow_block+0x144/0x2c0
[ 3680.515670] merge_reloc_root+0x4d4/0x610
[ 3680.515675] ? btrfs_lookup_fs_root+0x78/0x90
[ 3680.515686] merge_reloc_roots+0xee/0x280
[ 3680.515695] relocate_block_group+0x2ce/0x5e0
[ 3680.515704] btrfs_relocate_block_group+0x16e/0x310
[ 3680.515711] btrfs_relocate_chunk+0x38/0xf0
[ 3680.515716] btrfs_shrink_device+0x200/0x560
[ 3680.515728] btrfs_rm_device+0x1ae/0x6a6
[ 3680.515744] ? _copy_from_user+0x6e/0xb0
[ 3680.515750] btrfs_ioctl+0x1afe/0x28c0
[ 3680.515755] ? find_held_lock+0x2b/0x80
[ 3680.515760] ? do_user_addr_fault+0x1f8/0x418
[ 3680.515773] ? __x64_sys_ioctl+0x77/0xb0
[ 3680.515775] __x64_sys_ioctl+0x77/0xb0
[ 3680.515781] do_syscall_64+0x31/0x70
[ 3680.515785] entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported-by: Nikolay Borisov <nborisov@suse.com>
Fixes:
|
|
|
|
e38fdb7167 |
btrfs: print the block rsv type when we fail our reservation
To help with debugging, print the type of the block rsv when we fail to use our target block rsv in btrfs_use_block_rsv. This now produces: [ 544.672035] BTRFS: block rsv 1 returned -28 which is still cryptic without consulting the enum in block-rsv.h but I guess it's better than nothing. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add note from Nikolay ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a1fbc6750e |
btrfs: fix potential overflow in cluster_pages_for_defrag on 32bit arch
On 32-bit systems, this shift will overflow for files larger than 4GB as
start_index is unsigned long while the calls to btrfs_delalloc_*_space
expect u64.
CC: stable@vger.kernel.org # 4.4+
Fixes:
|
|
|
|
f5d808567a |
for-5.10-rc1-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl+cNwcACgkQxWXV+ddt
WDth+g//esuxzGaUenIuMgnT8ofnte4I9Kst8ShBAl1Asglq1p1WBJfYtHbMMqeE
CmJ32hGs6JBoaB/Wdta41F840BxRbaOCyo10oB6jx5QV2rCh99PvPhwmsmaUGQHG
02umRPRJcILReB53LwJvLQTQMVfuUqfBV2TyyLHrY8R8pIGsG61p1d3cgg+NWtKQ
c3RC2eH/uIeQTDaZX0ZOpa6TBPOs+MNDiF5d3UxvpiBXWum3yijdXEfhpfiOom4A
eCH+lj+iQQ4EtoKjXi0q7ziU1eAKWkQ3A4rMo9fr7iQkQIVkvZc2d9WALsF+znXi
f2ochi3msemX19I5g0RQ1s5XExCKBSbr6v934BDlpAZ8Pc4IFz9rkLZwlMYp/SVJ
9aYZOG9Rm0P3DaiYPvKZBcxsTRvxXlXlVWfMCGUB4KKaGyawJgSZxnegxP8nWb2C
+VrvVw2NJsoioQTX+2OqUc2FCuDib7ehjH80q9IXLlozfoKA4Lfj1G6qCUEsuffI
NbW5Ndkaza/qw3mOTEn/sU9+kzr1P8CVtSFWcI/GJqp/kisTAYtyfU/GWD9JLi8s
uaHGAZXdCEVNJ2opgnLiW0ZPNMm4oSernn8JckhsUUGUJwJ4c1pGEHSzIHEF+d9E
HBmjQN6qcW4DDUb1rSQEBsFoKXeebp7P6usNNrUqau1wWKBworA=
=bR/o
-----END PGP SIGNATURE-----
Merge tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- lockdep fixes:
- drop path locks before manipulating sysfs objects or qgroups
- preliminary fixes before tree locks get switched to rwsem
- use annotated seqlock
- build warning fixes (printk format)
- fix relocation vs fallocate race
- tree checker properly validates number of stripes and parity
- readahead vs device replace fixes
- iomap dio fix for unnecessary buffered io fallback
* tag 'for-5.10-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: convert data_seqcount to seqcount_mutex_t
btrfs: don't fallback to buffered read if we don't need to
btrfs: add a helper to read the tree_root commit root for backref lookup
btrfs: drop the path before adding qgroup items when enabling qgroups
btrfs: fix readahead hang and use-after-free after removing a device
btrfs: fix use-after-free on readahead extent after failure to create it
btrfs: tree-checker: validate number of chunk stripes and parity
btrfs: tree-checker: fix incorrect printk format
btrfs: drop the path before adding block group sysfs files
btrfs: fix relocation failure due to race with fallocate
|
|
|
|
d5c8238849 |
btrfs: convert data_seqcount to seqcount_mutex_t
By doing so we can associate the sequence counter to the chunk_mutex for lockdep purposes (compiled-out otherwise), the mutex is otherwise used on the write side. Also avoid explicitly disabling preemption around the write region as it will now be done automatically by the seqcount machinery based on the lock type. Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0425e7badb |
btrfs: don't fallback to buffered read if we don't need to
Since we switched to the iomap infrastructure in b5ff9f1a96e8f ("btrfs:
switch to iomap for direct IO") we're calling generic_file_buffered_read()
directly and not via generic_file_read_iter() anymore.
If the read could read everything there is no need to bother calling
generic_file_buffered_read(), like it is handled in
generic_file_read_iter().
If we call generic_file_buffered_read() in this case we can hit a
situation where we do an invalid readahead and cause this UBSAN splat
in fstest generic/091:
run fstests generic/091 at 2020-10-21 10:52:32
================================================================================
UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
shift exponent 64 is too large for 64-bit type 'long unsigned int'
CPU: 0 PID: 656 Comm: fsx Not tainted 5.9.0-rc7+ #821
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014
Call Trace:
__dump_stack lib/dump_stack.c:77
dump_stack+0x57/0x70 lib/dump_stack.c:118
ubsan_epilogue+0x5/0x40 lib/ubsan.c:148
__ubsan_handle_shift_out_of_bounds.cold+0x61/0xe9 lib/ubsan.c:395
__roundup_pow_of_two ./include/linux/log2.h:57
get_init_ra_size mm/readahead.c:318
ondemand_readahead.cold+0x16/0x2c mm/readahead.c:530
generic_file_buffered_read+0x3ac/0x840 mm/filemap.c:2199
call_read_iter ./include/linux/fs.h:1876
new_sync_read+0x102/0x180 fs/read_write.c:415
vfs_read+0x11c/0x1a0 fs/read_write.c:481
ksys_read+0x4f/0xc0 fs/read_write.c:615
do_syscall_64+0x33/0x40 arch/x86/entry/common.c:46
entry_SYSCALL_64_after_hwframe+0x44/0xa9 arch/x86/entry/entry_64.S:118
RIP: 0033:0x7fe87fee992e
RSP: 002b:00007ffe01605278 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000000000004f000 RCX: 00007fe87fee992e
RDX: 0000000000004000 RSI: 0000000001677000 RDI: 0000000000000003
RBP: 000000000004f000 R08: 0000000000004000 R09: 000000000004f000
R10: 0000000000053000 R11: 0000000000000246 R12: 0000000000004000
R13: 0000000000000000 R14: 000000000007a120 R15: 0000000000000000
================================================================================
BTRFS info (device nullb0): has skinny extents
BTRFS info (device nullb0): ZONED mode enabled, zone size 268435456 B
BTRFS info (device nullb0): enabling ssd optimizations
Fixes:
|
|
|
|
49d11bead7 |
btrfs: add a helper to read the tree_root commit root for backref lookup
I got the following lockdep splat with tree locks converted to rwsem patches on btrfs/104: ====================================================== WARNING: possible circular locking dependency detected 5.9.0+ #102 Not tainted ------------------------------------------------------ btrfs-cleaner/903 is trying to acquire lock: ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170 but task is already holding lock: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&fs_info->commit_root_sem){++++}-{3:3}: down_read+0x40/0x130 caching_thread+0x53/0x5a0 btrfs_work_helper+0xfa/0x520 process_one_work+0x238/0x540 worker_thread+0x55/0x3c0 kthread+0x13a/0x150 ret_from_fork+0x1f/0x30 -> #2 (&caching_ctl->mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7b0 btrfs_cache_block_group+0x1e0/0x510 find_free_extent+0xb6e/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&space_info->groups_sem){++++}-{3:3}: down_read+0x40/0x130 find_free_extent+0x2ed/0x12f0 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb1/0x330 alloc_tree_block_no_bg_flush+0x4f/0x60 __btrfs_cow_block+0x11d/0x580 btrfs_cow_block+0x10c/0x220 commit_cowonly_roots+0x47/0x2e0 btrfs_commit_transaction+0x595/0xbd0 sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x36/0xa0 cleanup_mnt+0x12d/0x190 task_work_run+0x5c/0xa0 exit_to_user_mode_prepare+0x1df/0x200 syscall_exit_to_user_mode+0x54/0x280 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x1167/0x2150 lock_acquire+0xb9/0x3d0 down_read_nested+0x43/0x130 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 btrfs_find_root+0x35/0x1b0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 kthread+0x13a/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->commit_root_sem); lock(&caching_ctl->mutex); lock(&fs_info->commit_root_sem); lock(btrfs-root-00); *** DEADLOCK *** 3 locks held by btrfs-cleaner/903: #0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140 #1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0 #2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80 stack backtrace: CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb0 check_noncircular+0xcf/0xf0 __lock_acquire+0x1167/0x2150 ? __bfs+0x42/0x210 lock_acquire+0xb9/0x3d0 ? __btrfs_tree_read_lock+0x32/0x170 down_read_nested+0x43/0x130 ? __btrfs_tree_read_lock+0x32/0x170 __btrfs_tree_read_lock+0x32/0x170 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x614/0x9d0 ? find_held_lock+0x2b/0x80 btrfs_find_root+0x35/0x1b0 ? do_raw_spin_unlock+0x4b/0xa0 btrfs_read_tree_root+0x61/0x120 btrfs_get_root_ref+0x14b/0x600 find_parent_nodes+0x3e6/0x1b30 btrfs_find_all_roots_safe+0xb4/0x130 btrfs_find_all_roots+0x60/0x80 btrfs_qgroup_trace_extent_post+0x27/0x40 btrfs_add_delayed_data_ref+0x3fd/0x460 btrfs_free_extent+0x42/0x100 __btrfs_mod_ref+0x1d7/0x2f0 walk_up_proc+0x11c/0x400 walk_up_tree+0xf0/0x180 btrfs_drop_snapshot+0x1c7/0x780 ? btrfs_clean_one_deleted_snapshot+0x73/0x110 btrfs_clean_one_deleted_snapshot+0xfb/0x110 cleaner_kthread+0xd4/0x140 ? btrfs_alloc_root+0x50/0x50 kthread+0x13a/0x150 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 BTRFS info (device sdb): disk space caching is enabled BTRFS info (device sdb): has skinny extents This happens because qgroups does a backref lookup when we create a delayed ref. From here it may have to look up a root from an indirect ref, which does a normal lookup on the tree_root, which takes the read lock on the tree_root nodes. To fix this we need to add a variant for looking up roots that searches the commit root of the tree_root. Then when we do the backref search using the commit root we are sure to not take any locks on the tree_root nodes. This gets rid of the lockdep splat when running btrfs/104. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5223cc60b4 |
btrfs: drop the path before adding qgroup items when enabling qgroups
When enabling qgroups we walk the tree_root and then add a qgroup item for every root that we have. This creates a lock dependency on the tree_root and qgroup_root, which results in the following lockdep splat (with tree locks using rwsem), eg. in tests btrfs/017 or btrfs/022: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-default+ #1299 Not tainted ------------------------------------------------------ btrfs/24552 is trying to acquire lock: ffff9142dfc5f630 (btrfs-quota-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] but task is already holding lock: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x3fb/0x730 lock_acquire.part.0+0x6a/0x130 down_read_nested+0x46/0x130 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot_get_root+0x11d/0x290 [btrfs] btrfs_search_slot+0xc3/0x9f0 [btrfs] btrfs_insert_item+0x6e/0x140 [btrfs] btrfs_create_tree+0x1cb/0x240 [btrfs] btrfs_quota_enable+0xcd/0x790 [btrfs] btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs] __x64_sys_ioctl+0x83/0xa0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (btrfs-quota-00){++++}-{3:3}: check_prev_add+0x91/0xc30 validate_chain+0x491/0x750 __lock_acquire+0x3fb/0x730 lock_acquire.part.0+0x6a/0x130 down_read_nested+0x46/0x130 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot_get_root+0x11d/0x290 [btrfs] btrfs_search_slot+0xc3/0x9f0 [btrfs] btrfs_insert_empty_items+0x58/0xa0 [btrfs] add_qgroup_item.part.0+0x72/0x210 [btrfs] btrfs_quota_enable+0x3bb/0x790 [btrfs] btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs] __x64_sys_ioctl+0x83/0xa0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-root-00); lock(btrfs-quota-00); lock(btrfs-root-00); lock(btrfs-quota-00); *** DEADLOCK *** 5 locks held by btrfs/24552: #0: ffff9142df431478 (sb_writers#10){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0xa0 #1: ffff9142f9b10cc0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl_quota_ctl+0x7b/0xe0 [btrfs] #2: ffff9142f9b11a08 (&fs_info->qgroup_ioctl_lock){+.+.}-{3:3}, at: btrfs_quota_enable+0x3b/0x790 [btrfs] #3: ffff9142df431698 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x406/0x510 [btrfs] #4: ffff9142dfc5d0b0 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] stack backtrace: CPU: 1 PID: 24552 Comm: btrfs Not tainted 5.9.0-default+ #1299 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x77/0x97 check_noncircular+0xf3/0x110 check_prev_add+0x91/0xc30 validate_chain+0x491/0x750 __lock_acquire+0x3fb/0x730 lock_acquire.part.0+0x6a/0x130 ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] ? lock_acquire+0xc4/0x140 ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] down_read_nested+0x46/0x130 ? __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] ? btrfs_root_node+0xd9/0x200 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot_get_root+0x11d/0x290 [btrfs] btrfs_search_slot+0xc3/0x9f0 [btrfs] btrfs_insert_empty_items+0x58/0xa0 [btrfs] add_qgroup_item.part.0+0x72/0x210 [btrfs] btrfs_quota_enable+0x3bb/0x790 [btrfs] btrfs_ioctl_quota_ctl+0xc9/0xe0 [btrfs] __x64_sys_ioctl+0x83/0xa0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix this by dropping the path whenever we find a root item, add the qgroup item, and then re-lookup the root item we found and continue processing roots. Reported-by: David Sterba <dsterba@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
66d204a16c |
btrfs: fix readahead hang and use-after-free after removing a device
Very sporadically I had test case btrfs/069 from fstests hanging (for years, it is not a recent regression), with the following traces in dmesg/syslog: [162301.160628] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg started [162301.181196] BTRFS info (device sdc): scrub: finished on devid 4 with status: 0 [162301.287162] BTRFS info (device sdc): dev_replace from /dev/sdd (devid 2) to /dev/sdg finished [162513.513792] INFO: task btrfs-transacti:1356167 blocked for more than 120 seconds. [162513.514318] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.514522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.514747] task:btrfs-transacti state:D stack: 0 pid:1356167 ppid: 2 flags:0x00004000 [162513.514751] Call Trace: [162513.514761] __schedule+0x5ce/0xd00 [162513.514765] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.514771] schedule+0x46/0xf0 [162513.514844] wait_current_trans+0xde/0x140 [btrfs] [162513.514850] ? finish_wait+0x90/0x90 [162513.514864] start_transaction+0x37c/0x5f0 [btrfs] [162513.514879] transaction_kthread+0xa4/0x170 [btrfs] [162513.514891] ? btrfs_cleanup_transaction+0x660/0x660 [btrfs] [162513.514894] kthread+0x153/0x170 [162513.514897] ? kthread_stop+0x2c0/0x2c0 [162513.514902] ret_from_fork+0x22/0x30 [162513.514916] INFO: task fsstress:1356184 blocked for more than 120 seconds. [162513.515192] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.515431] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.515680] task:fsstress state:D stack: 0 pid:1356184 ppid:1356177 flags:0x00004000 [162513.515682] Call Trace: [162513.515688] __schedule+0x5ce/0xd00 [162513.515691] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.515697] schedule+0x46/0xf0 [162513.515712] wait_current_trans+0xde/0x140 [btrfs] [162513.515716] ? finish_wait+0x90/0x90 [162513.515729] start_transaction+0x37c/0x5f0 [btrfs] [162513.515743] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.515753] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.515758] ? __ia32_sys_fdatasync+0x20/0x20 [162513.515761] iterate_supers+0x87/0xf0 [162513.515765] ksys_sync+0x60/0xb0 [162513.515768] __do_sys_sync+0xa/0x10 [162513.515771] do_syscall_64+0x33/0x80 [162513.515774] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.515781] RIP: 0033:0x7f5238f50bd7 [162513.515782] Code: Bad RIP value. [162513.515784] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.515786] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.515788] RDX: 00000000ffffffff RSI: 000000000daf0e74 RDI: 000000000000003a [162513.515789] RBP: 0000000000000032 R08: 000000000000000a R09: 00007f5239019be0 [162513.515791] R10: fffffffffffff24f R11: 0000000000000206 R12: 000000000000003a [162513.515792] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.515804] INFO: task fsstress:1356185 blocked for more than 120 seconds. [162513.516064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.516329] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.516617] task:fsstress state:D stack: 0 pid:1356185 ppid:1356177 flags:0x00000000 [162513.516620] Call Trace: [162513.516625] __schedule+0x5ce/0xd00 [162513.516628] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.516634] schedule+0x46/0xf0 [162513.516647] wait_current_trans+0xde/0x140 [btrfs] [162513.516650] ? finish_wait+0x90/0x90 [162513.516662] start_transaction+0x4d7/0x5f0 [btrfs] [162513.516679] btrfs_setxattr_trans+0x3c/0x100 [btrfs] [162513.516686] __vfs_setxattr+0x66/0x80 [162513.516691] __vfs_setxattr_noperm+0x70/0x200 [162513.516697] vfs_setxattr+0x6b/0x120 [162513.516703] setxattr+0x125/0x240 [162513.516709] ? lock_acquire+0xb1/0x480 [162513.516712] ? mnt_want_write+0x20/0x50 [162513.516721] ? rcu_read_lock_any_held+0x8e/0xb0 [162513.516723] ? preempt_count_add+0x49/0xa0 [162513.516725] ? __sb_start_write+0x19b/0x290 [162513.516727] ? preempt_count_add+0x49/0xa0 [162513.516732] path_setxattr+0xba/0xd0 [162513.516739] __x64_sys_setxattr+0x27/0x30 [162513.516741] do_syscall_64+0x33/0x80 [162513.516743] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.516745] RIP: 0033:0x7f5238f56d5a [162513.516746] Code: Bad RIP value. [162513.516748] RSP: 002b:00007fff67b97868 EFLAGS: 00000202 ORIG_RAX: 00000000000000bc [162513.516750] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f5238f56d5a [162513.516751] RDX: 000055b1fbb0d5a0 RSI: 00007fff67b978a0 RDI: 000055b1fbb0d470 [162513.516753] RBP: 000055b1fbb0d5a0 R08: 0000000000000001 R09: 00007fff67b97700 [162513.516754] R10: 0000000000000004 R11: 0000000000000202 R12: 0000000000000004 [162513.516756] R13: 0000000000000024 R14: 0000000000000001 R15: 00007fff67b978a0 [162513.516767] INFO: task fsstress:1356196 blocked for more than 120 seconds. [162513.517064] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.517365] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.517763] task:fsstress state:D stack: 0 pid:1356196 ppid:1356177 flags:0x00004000 [162513.517780] Call Trace: [162513.517786] __schedule+0x5ce/0xd00 [162513.517789] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.517796] schedule+0x46/0xf0 [162513.517810] wait_current_trans+0xde/0x140 [btrfs] [162513.517814] ? finish_wait+0x90/0x90 [162513.517829] start_transaction+0x37c/0x5f0 [btrfs] [162513.517845] btrfs_attach_transaction_barrier+0x1f/0x50 [btrfs] [162513.517857] btrfs_sync_fs+0x61/0x1c0 [btrfs] [162513.517862] ? __ia32_sys_fdatasync+0x20/0x20 [162513.517865] iterate_supers+0x87/0xf0 [162513.517869] ksys_sync+0x60/0xb0 [162513.517872] __do_sys_sync+0xa/0x10 [162513.517875] do_syscall_64+0x33/0x80 [162513.517878] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.517881] RIP: 0033:0x7f5238f50bd7 [162513.517883] Code: Bad RIP value. [162513.517885] RSP: 002b:00007fff67b978e8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a2 [162513.517887] RAX: ffffffffffffffda RBX: 000055b1fad2c560 RCX: 00007f5238f50bd7 [162513.517889] RDX: 0000000000000000 RSI: 000000007660add2 RDI: 0000000000000053 [162513.517891] RBP: 0000000000000032 R08: 0000000000000067 R09: 00007f5239019be0 [162513.517893] R10: fffffffffffff24f R11: 0000000000000206 R12: 0000000000000053 [162513.517895] R13: 00007fff67b97950 R14: 00007fff67b97906 R15: 000055b1fad1a340 [162513.517908] INFO: task fsstress:1356197 blocked for more than 120 seconds. [162513.518298] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.518672] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.519157] task:fsstress state:D stack: 0 pid:1356197 ppid:1356177 flags:0x00000000 [162513.519160] Call Trace: [162513.519165] __schedule+0x5ce/0xd00 [162513.519168] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.519174] schedule+0x46/0xf0 [162513.519190] wait_current_trans+0xde/0x140 [btrfs] [162513.519193] ? finish_wait+0x90/0x90 [162513.519206] start_transaction+0x4d7/0x5f0 [btrfs] [162513.519222] btrfs_create+0x57/0x200 [btrfs] [162513.519230] lookup_open+0x522/0x650 [162513.519246] path_openat+0x2b8/0xa50 [162513.519270] do_filp_open+0x91/0x100 [162513.519275] ? find_held_lock+0x32/0x90 [162513.519280] ? lock_acquired+0x33b/0x470 [162513.519285] ? do_raw_spin_unlock+0x4b/0xc0 [162513.519287] ? _raw_spin_unlock+0x29/0x40 [162513.519295] do_sys_openat2+0x20d/0x2d0 [162513.519300] do_sys_open+0x44/0x80 [162513.519304] do_syscall_64+0x33/0x80 [162513.519307] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.519309] RIP: 0033:0x7f5238f4a903 [162513.519310] Code: Bad RIP value. [162513.519312] RSP: 002b:00007fff67b97758 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 [162513.519314] RAX: ffffffffffffffda RBX: 00000000ffffffff RCX: 00007f5238f4a903 [162513.519316] RDX: 0000000000000000 RSI: 00000000000001b6 RDI: 000055b1fbb0d470 [162513.519317] RBP: 00007fff67b978c0 R08: 0000000000000001 R09: 0000000000000002 [162513.519319] R10: 00007fff67b974f7 R11: 0000000000000246 R12: 0000000000000013 [162513.519320] R13: 00000000000001b6 R14: 00007fff67b97906 R15: 000055b1fad1c620 [162513.519332] INFO: task btrfs:1356211 blocked for more than 120 seconds. [162513.519727] Not tainted 5.9.0-rc6-btrfs-next-69 #1 [162513.520115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [162513.520508] task:btrfs state:D stack: 0 pid:1356211 ppid:1356178 flags:0x00004002 [162513.520511] Call Trace: [162513.520516] __schedule+0x5ce/0xd00 [162513.520519] ? _raw_spin_unlock_irqrestore+0x3c/0x60 [162513.520525] schedule+0x46/0xf0 [162513.520544] btrfs_scrub_pause+0x11f/0x180 [btrfs] [162513.520548] ? finish_wait+0x90/0x90 [162513.520562] btrfs_commit_transaction+0x45a/0xc30 [btrfs] [162513.520574] ? start_transaction+0xe0/0x5f0 [btrfs] [162513.520596] btrfs_dev_replace_finishing+0x6d8/0x711 [btrfs] [162513.520619] btrfs_dev_replace_by_ioctl.cold+0x1cc/0x1fd [btrfs] [162513.520639] btrfs_ioctl+0x2a25/0x36f0 [btrfs] [162513.520643] ? do_sigaction+0xf3/0x240 [162513.520645] ? find_held_lock+0x32/0x90 [162513.520648] ? do_sigaction+0xf3/0x240 [162513.520651] ? lock_acquired+0x33b/0x470 [162513.520655] ? _raw_spin_unlock_irq+0x24/0x50 [162513.520657] ? lockdep_hardirqs_on+0x7d/0x100 [162513.520660] ? _raw_spin_unlock_irq+0x35/0x50 [162513.520662] ? do_sigaction+0xf3/0x240 [162513.520671] ? __x64_sys_ioctl+0x83/0xb0 [162513.520672] __x64_sys_ioctl+0x83/0xb0 [162513.520677] do_syscall_64+0x33/0x80 [162513.520679] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [162513.520681] RIP: 0033:0x7fc3cd307d87 [162513.520682] Code: Bad RIP value. [162513.520684] RSP: 002b:00007ffe30a56bb8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [162513.520686] RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00007fc3cd307d87 [162513.520687] RDX: 00007ffe30a57a30 RSI: 00000000ca289435 RDI: 0000000000000003 [162513.520689] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 [162513.520690] R10: 0000000000000008 R11: 0000000000000202 R12: 0000000000000003 [162513.520692] R13: 0000557323a212e0 R14: 00007ffe30a5a520 R15: 0000000000000001 [162513.520703] Showing all locks held in the system: [162513.520712] 1 lock held by khungtaskd/54: [162513.520713] #0: ffffffffb40a91a0 (rcu_read_lock){....}-{1:2}, at: debug_show_all_locks+0x15/0x197 [162513.520728] 1 lock held by in:imklog/596: [162513.520729] #0: ffff8f3f0d781400 (&f->f_pos_lock){+.+.}-{3:3}, at: __fdget_pos+0x4d/0x60 [162513.520782] 1 lock held by btrfs-transacti/1356167: [162513.520784] #0: ffff8f3d810cc848 (&fs_info->transaction_kthread_mutex){+.+.}-{3:3}, at: transaction_kthread+0x4a/0x170 [btrfs] [162513.520798] 1 lock held by btrfs/1356190: [162513.520800] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write_file+0x22/0x60 [162513.520805] 1 lock held by fsstress/1356184: [162513.520806] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520811] 3 locks held by fsstress/1356185: [162513.520812] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520815] #1: ffff8f3d80a650b8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: vfs_setxattr+0x50/0x120 [162513.520820] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520833] 1 lock held by fsstress/1356196: [162513.520834] #0: ffff8f3d576440e8 (&type->s_umount_key#62){++++}-{3:3}, at: iterate_supers+0x6f/0xf0 [162513.520838] 3 locks held by fsstress/1356197: [162513.520839] #0: ffff8f3d57644470 (sb_writers#15){.+.+}-{0:0}, at: mnt_want_write+0x20/0x50 [162513.520843] #1: ffff8f3d506465e8 (&type->i_mutex_dir_key#10){++++}-{3:3}, at: path_openat+0x2a7/0xa50 [162513.520846] #2: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] [162513.520858] 2 locks held by btrfs/1356211: [162513.520859] #0: ffff8f3d810cde30 (&fs_info->dev_replace.lock_finishing_cancel_unmount){+.+.}-{3:3}, at: btrfs_dev_replace_finishing+0x52/0x711 [btrfs] [162513.520877] #1: ffff8f3d57644690 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40e/0x5f0 [btrfs] This was weird because the stack traces show that a transaction commit, triggered by a device replace operation, is blocking trying to pause any running scrubs but there are no stack traces of blocked tasks doing a scrub. After poking around with drgn, I noticed there was a scrub task that was constantly running and blocking for shorts periods of time: >>> t = find_task(prog, 1356190) >>> prog.stack_trace(t) #0 __schedule+0x5ce/0xcfc #1 schedule+0x46/0xe4 #2 schedule_timeout+0x1df/0x475 #3 btrfs_reada_wait+0xda/0x132 #4 scrub_stripe+0x2a8/0x112f #5 scrub_chunk+0xcd/0x134 #6 scrub_enumerate_chunks+0x29e/0x5ee #7 btrfs_scrub_dev+0x2d5/0x91b #8 btrfs_ioctl+0x7f5/0x36e7 #9 __x64_sys_ioctl+0x83/0xb0 #10 do_syscall_64+0x33/0x77 #11 entry_SYSCALL_64+0x7c/0x156 Which corresponds to: int btrfs_reada_wait(void *handle) { struct reada_control *rc = handle; struct btrfs_fs_info *fs_info = rc->fs_info; while (atomic_read(&rc->elems)) { if (!atomic_read(&fs_info->reada_works_cnt)) reada_start_machine(fs_info); wait_event_timeout(rc->wait, atomic_read(&rc->elems) == 0, (HZ + 9) / 10); } (...) So the counter "rc->elems" was set to 1 and never decreased to 0, causing the scrub task to loop forever in that function. Then I used the following script for drgn to check the readahead requests: $ cat dump_reada.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) def dump_re(re): nzones = re.nzones.value_() print(f're at {hex(re.value_())}') print(f'\t logical {re.logical.value_()}') print(f'\t refcnt {re.refcnt.value_()}') print(f'\t nzones {nzones}') for i in range(nzones): dev = re.zones[i].device name = dev.name.str.string_() print(f'\t\t dev id {dev.devid.value_()} name {name}') print() for _, e in radix_tree_for_each(fs_info.reada_tree): re = cast('struct reada_extent *', e) dump_re(re) $ drgn dump_reada.py re at 0xffff8f3da9d25ad8 logical 38928384 refcnt 1 nzones 1 dev id 0 name b'/dev/sdd' $ So there was one readahead extent with a single zone corresponding to the source device of that last device replace operation logged in dmesg/syslog. Also the ID of that zone's device was 0 which is a special value set in the source device of a device replace operation when the operation finishes (constant BTRFS_DEV_REPLACE_DEVID set at btrfs_dev_replace_finishing()), confirming again that device /dev/sdd was the source of a device replace operation. Normally there should be as many zones in the readahead extent as there are devices, and I wasn't expecting the extent to be in a block group with a 'single' profile, so I went and confirmed with the following drgn script that there weren't any single profile block groups: $ cat dump_block_groups.py import sys import drgn from drgn import NULL, Object, cast, container_of, execscript, \ reinterpret, sizeof from drgn.helpers.linux import * mnt_path = b"/home/fdmanana/btrfs-tests/scratch_1" mnt = None for mnt in for_each_mount(prog, dst = mnt_path): pass if mnt is None: sys.stderr.write(f'Error: mount point {mnt_path} not found\n') sys.exit(1) fs_info = cast('struct btrfs_fs_info *', mnt.mnt.mnt_sb.s_fs_info) BTRFS_BLOCK_GROUP_DATA = (1 << 0) BTRFS_BLOCK_GROUP_SYSTEM = (1 << 1) BTRFS_BLOCK_GROUP_METADATA = (1 << 2) BTRFS_BLOCK_GROUP_RAID0 = (1 << 3) BTRFS_BLOCK_GROUP_RAID1 = (1 << 4) BTRFS_BLOCK_GROUP_DUP = (1 << 5) BTRFS_BLOCK_GROUP_RAID10 = (1 << 6) BTRFS_BLOCK_GROUP_RAID5 = (1 << 7) BTRFS_BLOCK_GROUP_RAID6 = (1 << 8) BTRFS_BLOCK_GROUP_RAID1C3 = (1 << 9) BTRFS_BLOCK_GROUP_RAID1C4 = (1 << 10) def bg_flags_string(bg): flags = bg.flags.value_() ret = '' if flags & BTRFS_BLOCK_GROUP_DATA: ret = 'data' if flags & BTRFS_BLOCK_GROUP_METADATA: if len(ret) > 0: ret += '|' ret += 'meta' if flags & BTRFS_BLOCK_GROUP_SYSTEM: if len(ret) > 0: ret += '|' ret += 'system' if flags & BTRFS_BLOCK_GROUP_RAID0: ret += ' raid0' elif flags & BTRFS_BLOCK_GROUP_RAID1: ret += ' raid1' elif flags & BTRFS_BLOCK_GROUP_DUP: ret += ' dup' elif flags & BTRFS_BLOCK_GROUP_RAID10: ret += ' raid10' elif flags & BTRFS_BLOCK_GROUP_RAID5: ret += ' raid5' elif flags & BTRFS_BLOCK_GROUP_RAID6: ret += ' raid6' elif flags & BTRFS_BLOCK_GROUP_RAID1C3: ret += ' raid1c3' elif flags & BTRFS_BLOCK_GROUP_RAID1C4: ret += ' raid1c4' else: ret += ' single' return ret def dump_bg(bg): print() print(f'block group at {hex(bg.value_())}') print(f'\t start {bg.start.value_()} length {bg.length.value_()}') print(f'\t flags {bg.flags.value_()} - {bg_flags_string(bg)}') bg_root = fs_info.block_group_cache_tree.address_of_() for bg in rbtree_inorder_for_each_entry('struct btrfs_block_group', bg_root, 'cache_node'): dump_bg(bg) $ drgn dump_block_groups.py block group at 0xffff8f3d673b0400 start 22020096 length 16777216 flags 258 - system raid6 block group at 0xffff8f3d53ddb400 start 38797312 length 536870912 flags 260 - meta raid6 block group at 0xffff8f3d5f4d9c00 start 575668224 length 2147483648 flags 257 - data raid6 block group at 0xffff8f3d08189000 start 2723151872 length 67108864 flags 258 - system raid6 block group at 0xffff8f3db70ff000 start 2790260736 length 1073741824 flags 260 - meta raid6 block group at 0xffff8f3d5f4dd800 start 3864002560 length 67108864 flags 258 - system raid6 block group at 0xffff8f3d67037000 start 3931111424 length 2147483648 flags 257 - data raid6 $ So there were only 2 reasons left for having a readahead extent with a single zone: reada_find_zone(), called when creating a readahead extent, returned NULL either because we failed to find the corresponding block group or because a memory allocation failed. With some additional and custom tracing I figured out that on every further ocurrence of the problem the block group had just been deleted when we were looping to create the zones for the readahead extent (at reada_find_extent()), so we ended up with only one zone in the readahead extent, corresponding to a device that ends up getting replaced. So after figuring that out it became obvious why the hang happens: 1) Task A starts a scrub on any device of the filesystem, except for device /dev/sdd; 2) Task B starts a device replace with /dev/sdd as the source device; 3) Task A calls btrfs_reada_add() from scrub_stripe() and it is currently starting to scrub a stripe from block group X. This call to btrfs_reada_add() is the one for the extent tree. When btrfs_reada_add() calls reada_add_block(), it passes the logical address of the extent tree's root node as its 'logical' argument - a value of 38928384; 4) Task A then enters reada_find_extent(), called from reada_add_block(). It finds there isn't any existing readahead extent for the logical address 38928384, so it proceeds to the path of creating a new one. It calls btrfs_map_block() to find out which stripes exist for the block group X. On the first iteration of the for loop that iterates over the stripes, it finds the stripe for device /dev/sdd, so it creates one zone for that device and adds it to the readahead extent. Before getting into the second iteration of the loop, the cleanup kthread deletes block group X because it was empty. So in the iterations for the remaining stripes it does not add more zones to the readahead extent, because the calls to reada_find_zone() returned NULL because they couldn't find block group X anymore. As a result the new readahead extent has a single zone, corresponding to the device /dev/sdd; 4) Before task A returns to btrfs_reada_add() and queues the readahead job for the readahead work queue, task B finishes the device replace and at btrfs_dev_replace_finishing() swaps the device /dev/sdd with the new device /dev/sdg; 5) Task A returns to reada_add_block(), which increments the counter "->elems" of the reada_control structure allocated at btrfs_reada_add(). Then it returns back to btrfs_reada_add() and calls reada_start_machine(). This queues a job in the readahead work queue to run the function reada_start_machine_worker(), which calls __reada_start_machine(). At __reada_start_machine() we take the device list mutex and for each device found in the current device list, we call reada_start_machine_dev() to start the readahead work. However at this point the device /dev/sdd was already freed and is not in the device list anymore. This means the corresponding readahead for the extent at 38928384 is never started, and therefore the "->elems" counter of the reada_control structure allocated at btrfs_reada_add() never goes down to 0, causing the call to btrfs_reada_wait(), done by the scrub task, to wait forever. Note that the readahead request can be made either after the device replace started or before it started, however in pratice it is very unlikely that a device replace is able to start after a readahead request is made and is able to complete before the readahead request completes - maybe only on a very small and nearly empty filesystem. This hang however is not the only problem we can have with readahead and device removals. When the readahead extent has other zones other than the one corresponding to the device that is being removed (either by a device replace or a device remove operation), we risk having a use-after-free on the device when dropping the last reference of the readahead extent. For example if we create a readahead extent with two zones, one for the device /dev/sdd and one for the device /dev/sde: 1) Before the readahead worker starts, the device /dev/sdd is removed, and the corresponding btrfs_device structure is freed. However the readahead extent still has the zone pointing to the device structure; 2) When the readahead worker starts, it only finds device /dev/sde in the current device list of the filesystem; 3) It starts the readahead work, at reada_start_machine_dev(), using the device /dev/sde; 4) Then when it finishes reading the extent from device /dev/sde, it calls __readahead_hook() which ends up dropping the last reference on the readahead extent through the last call to reada_extent_put(); 5) At reada_extent_put() it iterates over each zone of the readahead extent and attempts to delete an element from the device's 'reada_extents' radix tree, resulting in a use-after-free, as the device pointer of the zone for /dev/sdd is now stale. We can also access the device after dropping the last reference of a zone, through reada_zone_release(), also called by reada_extent_put(). And a device remove suffers the same problem, however since it shrinks the device size down to zero before removing the device, it is very unlikely to still have readahead requests not completed by the time we free the device, the only possibility is if the device has a very little space allocated. While the hang problem is exclusive to scrub, since it is currently the only user of btrfs_reada_add() and btrfs_reada_wait(), the use-after-free problem affects any path that triggers readhead, which includes btree_readahead_hook() and __readahead_hook() (a readahead worker can trigger readahed for the children of a node) for example - any path that ends up calling reada_add_block() can trigger the use-after-free after a device is removed. So fix this by waiting for any readahead requests for a device to complete before removing a device, ensuring that while waiting for existing ones no new ones can be made. This problem has been around for a very long time - the readahead code was added in 2011, device remove exists since 2008 and device replace was introduced in 2013, hard to pick a specific commit for a git Fixes tag. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
83bc1560e0 |
btrfs: fix use-after-free on readahead extent after failure to create it
If we fail to find suitable zones for a new readahead extent, we end up
leaving a stale pointer in the global readahead extents radix tree
(fs_info->reada_tree), which can trigger the following trace later on:
[13367.696354] BUG: kernel NULL pointer dereference, address: 00000000000000b0
[13367.696802] #PF: supervisor read access in kernel mode
[13367.697249] #PF: error_code(0x0000) - not-present page
[13367.697721] PGD 0 P4D 0
[13367.698171] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[13367.698632] CPU: 6 PID: 851214 Comm: btrfs Tainted: G W 5.9.0-rc6-btrfs-next-69 #1
[13367.699100] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[13367.700069] RIP: 0010:__lock_acquire+0x20a/0x3970
[13367.700562] Code: ff 1f 0f b7 c0 48 0f (...)
[13367.701609] RSP: 0018:ffffb14448f57790 EFLAGS: 00010046
[13367.702140] RAX: 0000000000000000 RBX: 29b935140c15e8cf RCX: 0000000000000000
[13367.702698] RDX: 0000000000000002 RSI: ffffffffb3d66bd0 RDI: 0000000000000046
[13367.703240] RBP: ffff8a52ba8ac040 R08: 00000c2866ad9288 R09: 0000000000000001
[13367.703783] R10: 0000000000000001 R11: 00000000b66d9b53 R12: ffff8a52ba8ac9b0
[13367.704330] R13: 0000000000000000 R14: ffff8a532b6333e8 R15: 0000000000000000
[13367.704880] FS: 00007fe1df6b5700(0000) GS:ffff8a5376600000(0000) knlGS:0000000000000000
[13367.705438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13367.705995] CR2: 00000000000000b0 CR3: 000000022cca8004 CR4: 00000000003706e0
[13367.706565] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[13367.707127] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[13367.707686] Call Trace:
[13367.708246] ? ___slab_alloc+0x395/0x740
[13367.708820] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.709383] lock_acquire+0xb1/0x480
[13367.709955] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.710537] ? reada_add_block+0xae/0xee0 [btrfs]
[13367.711097] ? rcu_read_lock_sched_held+0x5d/0x90
[13367.711659] ? kmem_cache_alloc_trace+0x8d2/0x990
[13367.712221] ? lock_acquired+0x33b/0x470
[13367.712784] _raw_spin_lock+0x34/0x80
[13367.713356] ? reada_add_block+0xe0/0xee0 [btrfs]
[13367.713966] reada_add_block+0xe0/0xee0 [btrfs]
[13367.714529] ? btrfs_root_node+0x15/0x1f0 [btrfs]
[13367.715077] btrfs_reada_add+0x117/0x170 [btrfs]
[13367.715620] scrub_stripe+0x21e/0x10d0 [btrfs]
[13367.716141] ? kvm_sched_clock_read+0x5/0x10
[13367.716657] ? __lock_acquire+0x41e/0x3970
[13367.717184] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.717697] ? find_held_lock+0x32/0x90
[13367.718254] ? scrub_chunk+0x60/0x140 [btrfs]
[13367.718773] ? lock_acquired+0x33b/0x470
[13367.719278] ? scrub_chunk+0xcd/0x140 [btrfs]
[13367.719786] scrub_chunk+0xcd/0x140 [btrfs]
[13367.720291] scrub_enumerate_chunks+0x270/0x5c0 [btrfs]
[13367.720787] ? finish_wait+0x90/0x90
[13367.721281] btrfs_scrub_dev+0x1ee/0x620 [btrfs]
[13367.721762] ? rcu_read_lock_any_held+0x8e/0xb0
[13367.722235] ? preempt_count_add+0x49/0xa0
[13367.722710] ? __sb_start_write+0x19b/0x290
[13367.723192] btrfs_ioctl+0x7f5/0x36f0 [btrfs]
[13367.723660] ? __fget_files+0x101/0x1d0
[13367.724118] ? find_held_lock+0x32/0x90
[13367.724559] ? __fget_files+0x101/0x1d0
[13367.724982] ? __x64_sys_ioctl+0x83/0xb0
[13367.725399] __x64_sys_ioctl+0x83/0xb0
[13367.725802] do_syscall_64+0x33/0x80
[13367.726188] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[13367.726574] RIP: 0033:0x7fe1df7add87
[13367.726948] Code: 00 00 00 48 8b 05 09 91 (...)
[13367.727763] RSP: 002b:00007fe1df6b4d48 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[13367.728179] RAX: ffffffffffffffda RBX: 000055ce1fb596a0 RCX: 00007fe1df7add87
[13367.728604] RDX: 000055ce1fb596a0 RSI: 00000000c400941b RDI: 0000000000000003
[13367.729021] RBP: 0000000000000000 R08: 00007fe1df6b5700 R09: 0000000000000000
[13367.729431] R10: 00007fe1df6b5700 R11: 0000000000000246 R12: 00007ffd922b07de
[13367.729842] R13: 00007ffd922b07df R14: 00007fe1df6b4e40 R15: 0000000000802000
[13367.730275] Modules linked in: btrfs blake2b_generic xor (...)
[13367.732638] CR2: 00000000000000b0
[13367.733166] ---[ end trace d298b6805556acd9 ]---
What happens is the following:
1) At reada_find_extent() we don't find any existing readahead extent for
the metadata extent starting at logical address X;
2) So we proceed to create a new one. We then call btrfs_map_block() to get
information about which stripes contain extent X;
3) After that we iterate over the stripes and create only one zone for the
readahead extent - only one because reada_find_zone() returned NULL for
all iterations except for one, either because a memory allocation failed
or it couldn't find the block group of the extent (it may have just been
deleted);
4) We then add the new readahead extent to the readahead extents radix
tree at fs_info->reada_tree;
5) Then we iterate over each zone of the new readahead extent, and find
that the device used for that zone no longer exists, because it was
removed or it was the source device of a device replace operation.
Since this left 'have_zone' set to 0, after finishing the loop we jump
to the 'error' label, call kfree() on the new readahead extent and
return without removing it from the radix tree at fs_info->reada_tree;
6) Any future call to reada_find_extent() for the logical address X will
find the stale pointer in the readahead extents radix tree, increment
its reference counter, which can trigger the use-after-free right
away or return it to the caller reada_add_block() that results in the
use-after-free of the example trace above.
So fix this by making sure we delete the readahead extent from the radix
tree if we fail to setup zones for it (when 'have_zone = 0').
Fixes:
|
|
|
|
85d07fbe09 |
btrfs: tree-checker: validate number of chunk stripes and parity
If there's no parity and num_stripes < ncopies, a crafted image can trigger a division by zero in calc_stripe_length(). The image was generated through fuzzing. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=209587 Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cad69d1396 |
btrfs: tree-checker: fix incorrect printk format
This patch addresses a compile warning:
fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]
Fixes:
|
|
|
|
7837fa8870 |
btrfs: drop the path before adding block group sysfs files
Dave reported a problem with my rwsem conversion patch where we got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-default+ #1297 Not tainted ------------------------------------------------------ kswapd0/76 is trying to acquire lock: ffff9d5d25df2530 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] but task is already holding lock: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 fs_reclaim_acquire.part.0+0x25/0x30 kmem_cache_alloc+0x30/0x9c0 alloc_inode+0x81/0x90 iget_locked+0xcd/0x1a0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x210 sysfs_get_tree+0x1a/0x50 vfs_get_tree+0x1d/0xb0 path_mount+0x70f/0xa80 do_mount+0x75/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (kernfs_mutex){+.+.}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 __mutex_lock+0xa0/0xaf0 kernfs_add_one+0x23/0x150 kernfs_create_dir_ns+0x58/0x80 sysfs_create_dir_ns+0x70/0xd0 kobject_add_internal+0xbb/0x2d0 kobject_add+0x7a/0xd0 btrfs_sysfs_add_block_group_type+0x141/0x1d0 [btrfs] btrfs_read_block_groups+0x1f1/0x8c0 [btrfs] open_ctree+0x981/0x1108 [btrfs] btrfs_mount_root.cold+0xe/0xb0 [btrfs] legacy_get_tree+0x2d/0x60 vfs_get_tree+0x1d/0xb0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] legacy_get_tree+0x2d/0x60 vfs_get_tree+0x1d/0xb0 path_mount+0x70f/0xa80 do_mount+0x75/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (btrfs-extent-00){++++}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot+0x6d4/0xfd0 [btrfs] check_committed_ref+0x69/0x200 [btrfs] btrfs_cross_ref_exist+0x65/0xb0 [btrfs] run_delalloc_nocow+0x446/0x9b0 [btrfs] btrfs_run_delalloc_range+0x61/0x6a0 [btrfs] writepage_delalloc+0xae/0x160 [btrfs] __extent_writepage+0x262/0x420 [btrfs] extent_write_cache_pages+0x2b6/0x510 [btrfs] extent_writepages+0x43/0x90 [btrfs] do_writepages+0x40/0xe0 __writeback_single_inode+0x62/0x610 writeback_sb_inodes+0x20f/0x500 wb_writeback+0xef/0x4a0 wb_do_writeback+0x49/0x2e0 wb_workfn+0x81/0x340 process_one_work+0x233/0x5d0 worker_thread+0x50/0x3b0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 -> #1 (btrfs-fs-00){++++}-{3:3}: __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 down_read_nested+0x45/0x220 __btrfs_tree_read_lock+0x35/0x1c0 [btrfs] __btrfs_read_lock_root_node+0x3a/0x50 [btrfs] btrfs_search_slot+0x6d4/0xfd0 [btrfs] btrfs_lookup_inode+0x3a/0xc0 [btrfs] __btrfs_update_delayed_inode+0x93/0x2c0 [btrfs] __btrfs_commit_inode_delayed_items+0x7de/0x850 [btrfs] __btrfs_run_delayed_items+0x8e/0x140 [btrfs] btrfs_commit_transaction+0x367/0xbc0 [btrfs] btrfs_mksubvol+0x2db/0x470 [btrfs] btrfs_mksnapshot+0x7b/0xb0 [btrfs] __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs] btrfs_ioctl+0xd0b/0x2690 [btrfs] __x64_sys_ioctl+0x6f/0xa0 do_syscall_64+0x2d/0x70 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add+0x91/0xc60 validate_chain+0xa6e/0x2a20 __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 __mutex_lock+0xa0/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x3cc/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x16d/0x3b0 shrink_slab+0xb1/0x2e0 shrink_node+0x230/0x6a0 balance_pgdat+0x325/0x750 kswapd+0x206/0x4d0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/76: #0: ffffffffa40cbba0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffffa40b8b58 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0 #2: ffff9d5d322390e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 2 PID: 76 Comm: kswapd0 Not tainted 5.9.0-default+ #1297 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x77/0x97 check_noncircular+0xff/0x110 ? save_trace+0x50/0x470 check_prev_add+0x91/0xc60 validate_chain+0xa6e/0x2a20 ? save_trace+0x50/0x470 __lock_acquire+0x582/0xac0 lock_acquire+0xca/0x430 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] __mutex_lock+0xa0/0xaf0 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] ? __lock_acquire+0x582/0xac0 ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] ? btrfs_evict_inode+0x30b/0x560 [btrfs] ? __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs] btrfs_evict_inode+0x3cc/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x16d/0x3b0 shrink_slab+0xb1/0x2e0 shrink_node+0x230/0x6a0 balance_pgdat+0x325/0x750 kswapd+0x206/0x4d0 ? finish_wait+0x90/0x90 ? balance_pgdat+0x750/0x750 kthread+0x137/0x150 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x1f/0x30 This happens because we are still holding the path open when we start adding the sysfs files for the block groups, which creates a dependency on fs_reclaim via the tree lock. Fix this by dropping the path before we start doing anything with sysfs. Reported-by: David Sterba <dsterba@suse.com> CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1afc708dca |
btrfs: fix relocation failure due to race with fallocate
When doing a fallocate() we have a short time window, after reserving an extent and before starting a transaction, where if relocation for the block group containing the reserved extent happens, we can end up missing the extent in the data relocation inode causing relocation to fail later. This only happens when we don't pass a transaction to the internal fallocate function __btrfs_prealloc_file_range(), which is for all the cases where fallocate() is called from user space (the internal use cases include space cache extent allocation and relocation). When the race triggers the relocation failure, it produces a trace like the following: [200611.995995] ------------[ cut here ]------------ [200611.997084] BTRFS: Transaction aborted (error -2) [200611.998208] WARNING: CPU: 3 PID: 235845 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x3a0/0x5b0 [btrfs] [200611.999042] Modules linked in: dm_thin_pool dm_persistent_data (...) [200612.003287] CPU: 3 PID: 235845 Comm: btrfs Not tainted 5.9.0-rc6-btrfs-next-69 #1 [200612.004442] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [200612.006186] RIP: 0010:__btrfs_cow_block+0x3a0/0x5b0 [btrfs] [200612.007110] Code: 1b 00 00 02 72 2a 83 f8 fb 0f 84 b8 01 (...) [200612.007341] BTRFS warning (device sdb): Skipping commit of aborted transaction. [200612.008959] RSP: 0018:ffffaee38550f918 EFLAGS: 00010286 [200612.009672] BTRFS: error (device sdb) in cleanup_transaction:1901: errno=-30 Readonly filesystem [200612.010428] RAX: 0000000000000000 RBX: ffff9174d96f4000 RCX: 0000000000000000 [200612.011078] BTRFS info (device sdb): forced readonly [200612.011862] RDX: 0000000000000001 RSI: ffffffffa8161978 RDI: 00000000ffffffff [200612.013215] RBP: ffff9172569a0f80 R08: 0000000000000000 R09: 0000000000000000 [200612.014263] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9174b8403b88 [200612.015203] R13: ffff9174b8400a88 R14: ffff9174c90f1000 R15: ffff9174a5a60e08 [200612.016182] FS: 00007fa55cf878c0(0000) GS:ffff9174ece00000(0000) knlGS:0000000000000000 [200612.017174] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [200612.018418] CR2: 00007f8fb8048148 CR3: 0000000428a46003 CR4: 00000000003706e0 [200612.019510] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [200612.020648] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [200612.021520] Call Trace: [200612.022434] btrfs_cow_block+0x10b/0x250 [btrfs] [200612.023407] do_relocation+0x54e/0x7b0 [btrfs] [200612.024343] ? do_raw_spin_unlock+0x4b/0xc0 [200612.025280] ? _raw_spin_unlock+0x29/0x40 [200612.026200] relocate_tree_blocks+0x3bc/0x6d0 [btrfs] [200612.027088] relocate_block_group+0x2f3/0x600 [btrfs] [200612.027961] btrfs_relocate_block_group+0x15e/0x340 [btrfs] [200612.028896] btrfs_relocate_chunk+0x38/0x110 [btrfs] [200612.029772] btrfs_balance+0xb22/0x1790 [btrfs] [200612.030601] ? btrfs_ioctl_balance+0x253/0x380 [btrfs] [200612.031414] btrfs_ioctl_balance+0x2cf/0x380 [btrfs] [200612.032279] btrfs_ioctl+0x620/0x36f0 [btrfs] [200612.033077] ? _raw_spin_unlock+0x29/0x40 [200612.033948] ? handle_mm_fault+0x116d/0x1ca0 [200612.034749] ? up_read+0x18/0x240 [200612.035542] ? __x64_sys_ioctl+0x83/0xb0 [200612.036244] __x64_sys_ioctl+0x83/0xb0 [200612.037269] do_syscall_64+0x33/0x80 [200612.038190] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [200612.038976] RIP: 0033:0x7fa55d07ed87 [200612.040127] Code: 00 00 00 48 8b 05 09 91 0c 00 64 c7 00 26 (...) [200612.041669] RSP: 002b:00007ffd5ebf03e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [200612.042437] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fa55d07ed87 [200612.043511] RDX: 00007ffd5ebf0470 RSI: 00000000c4009420 RDI: 0000000000000003 [200612.044250] RBP: 0000000000000003 R08: 000055d8362642a0 R09: 00007fa55d148be0 [200612.044963] R10: fffffffffffff52e R11: 0000000000000206 R12: 00007ffd5ebf1614 [200612.045683] R13: 00007ffd5ebf0470 R14: 0000000000000002 R15: 00007ffd5ebf0470 [200612.046361] irq event stamp: 0 [200612.047040] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [200612.047725] hardirqs last disabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0 [200612.048387] softirqs last enabled at (0): [<ffffffffa6eb5ab3>] copy_process+0x823/0x1bc0 [200612.049024] softirqs last disabled at (0): [<0000000000000000>] 0x0 [200612.049722] ---[ end trace 49006c6876e65227 ]--- The race happens like this: 1) Task A starts an fallocate() (plain or zero range) and it calls __btrfs_prealloc_file_range() with the 'trans' parameter set to NULL; 2) Task A calls btrfs_reserve_extent() and gets an extent that belongs to block group X; 3) Before task A gets into btrfs_replace_file_extents(), through the call to insert_prealloc_file_extent(), task B starts relocation of block group X; 4) Task B enters btrfs_relocate_block_group() and it sets block group X to RO mode; 5) Task B enters relocate_block_group(), it calls prepare_to_relocate() whichs joins/starts a transaction and then commits the transaction; 6) Task B then starts scanning the extent tree looking for extents that belong to block group X - it does not find yet the extent reserved by task A, since that extent was not yet added to the extent tree, as its delayed reference was not even yet created at this point; 7) The data relocation inode ends up not having the extent reserved by task A associated to it; 8) Task A then starts a transaction through btrfs_replace_file_extents(), inserts a file extent item in the subvolume tree pointing to the reserved extent and creates a delayed reference for it; 9) Task A finishes and returns success to user space; 10) Later on, while relocation is still in progress, the leaf where task A inserted the new file extent item is COWed, so we end up at __btrfs_cow_block(), which calls btrfs_reloc_cow_block(), and that in turn calls relocation.c:replace_file_extents(); 11) At relocation.c:replace_file_extents() we iterate over all the items in the leaf and find the file extent item pointing to the extent that was allocated by task A, and then call relocation.c:get_new_location(), to find the new location for the extent; 12) However relocation.c:get_new_location() fails, returning -ENOENT, because it couldn't find a corresponding file extent item associated with the data relocation inode. This is because the extent was not seen in the extent tree at step 6). The -ENOENT error is propagated to __btrfs_cow_block(), which aborts the transaction. So fix this simply by decrementing the block group's number of reservations after calling insert_prealloc_file_extent(), as relocation waits for that counter to go down to zero before calling prepare_to_relocate() and start looking for extents in the extent tree. This issue only started to happen recently as of commit |
|
|
|
3ad11d7ac8 |
block-5.10-2020-10-12
-----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl+EWUgQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpnoxEADCVSNBRkpV0OVkOEC3wf8EGhXhk01Jnjtl u5Mg2V55hcgJ0thQxBV/V28XyqmsEBrmAVi0Yf8Vr9Qbq4Ze08Wae4ChS4rEOyh1 jTcGYWx5aJB3ChLvV/HI0nWQ3bkj03mMrL3SW8rhhf5DTyKHsVeTenpx42Qu/FKf fRzi09FSr3Pjd0B+EX6gunwJnlyXQC5Fa4AA0GhnXJzAznANXxHkkcXu8a6Yw75x e28CfhIBliORsK8sRHLoUnPpeTe1vtxCBhBMsE+gJAj9ZUOWMzvNFIPP4FvfawDy 6cCQo2m1azJ/IdZZCDjFUWyjh+wxdKMp+NNryEcoV+VlqIoc3n98rFwrSL+GIq5Z WVwEwq+AcwoMCsD29Lu1ytL2PQ/RVqcJP5UheMrbL4vzefNfJFumQVZLIcX0k943 8dFL2QHL+H/hM9Dx5y5rjeiWkAlq75v4xPKVjh/DHb4nehddCqn/+DD5HDhNANHf c1kmmEuYhvLpIaC4DHjE6DwLh8TPKahJjwsGuBOTr7D93NUQD+OOWsIhX6mNISIl FFhP8cd0/ZZVV//9j+q+5B4BaJsT+ZtwmrelKFnPdwPSnh+3iu8zPRRWO+8P8fRC YvddxuJAmE6BLmsAYrdz6Xb/wqfyV44cEiyivF0oBQfnhbtnXwDnkDWSfJD1bvCm ZwfpDh2+Tg== =LzyE -----END PGP SIGNATURE----- Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ... |
|
|
|
1fd4033dd0 |
btrfs: rename BTRFS_INODE_ORDERED_DATA_CLOSE flag
Commit |
|
|
|
8d1a7aae89 |
btrfs: annotate device name rcu_string with __rcu
This patch fixes the following sparse errors in fs/btrfs/super.c in function btrfs_show_devname() fs/btrfs/super.c: error: incompatible types in comparison expression (different address spaces): fs/btrfs/super.c: struct rcu_string [noderef] <asn:4> * fs/btrfs/super.c: struct rcu_string * The error was because of the following line in function btrfs_show_devname(): if (first_dev) seq_escape(m, rcu_str_deref(first_dev->name), " \t\n\\"); Annotating the btrfs_device::name member with __rcu fixes the sparse error. Acked-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik04@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
96c2e067ed |
btrfs: skip devices without magic signature when mounting
Many things can happen after the device is scanned and before the device is mounted. One such thing is losing the BTRFS_MAGIC on the device. If it happens we still won't free that device from the memory and cause the userland confusion. For example: As the BTRFS_IOC_DEV_INFO still carries the device path which does not have the BTRFS_MAGIC, 'btrfs fi show' still lists device which does not belong to the filesystem anymore: $ mkfs.btrfs -fq -draid1 -mraid1 /dev/sda /dev/sdb $ wipefs -a /dev/sdb # /dev/sdb does not contain magic signature $ mount -o degraded /dev/sda /btrfs $ btrfs fi show -m Label: none uuid: 470ec6fb-646b-4464-b3cb-df1b26c527bd Total devices 2 FS bytes used 128.00KiB devid 1 size 3.00GiB used 571.19MiB path /dev/sda devid 2 size 3.00GiB used 571.19MiB path /dev/sdb We need to distinguish the missing signature and invalid superblock, so add a specific error code ENODATA for that. This also fixes failure of fstest btrfs/198. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
572c83acdc |
btrfs: cleanup cow block on error
In fstest btrfs/064 a transaction abort in __btrfs_cow_block could lead
to a system lockup. It gets stuck trying to write back inodes, and the
write back thread was trying to lock an extent buffer:
$ cat /proc/2143497/stack
[<0>] __btrfs_tree_lock+0x108/0x250
[<0>] lock_extent_buffer_for_io+0x35e/0x3a0
[<0>] btree_write_cache_pages+0x15a/0x3b0
[<0>] do_writepages+0x28/0xb0
[<0>] __writeback_single_inode+0x54/0x5c0
[<0>] writeback_sb_inodes+0x1e8/0x510
[<0>] wb_writeback+0xcc/0x440
[<0>] wb_workfn+0xd7/0x650
[<0>] process_one_work+0x236/0x560
[<0>] worker_thread+0x55/0x3c0
[<0>] kthread+0x13a/0x150
[<0>] ret_from_fork+0x1f/0x30
This is because we got an error while COWing a block, specifically here
if (test_bit(BTRFS_ROOT_SHAREABLE, &root->state)) {
ret = btrfs_reloc_cow_block(trans, root, buf, cow);
if (ret) {
btrfs_abort_transaction(trans, ret);
return ret;
}
}
[16402.241552] BTRFS: Transaction aborted (error -2)
[16402.242362] WARNING: CPU: 1 PID: 2563188 at fs/btrfs/ctree.c:1074 __btrfs_cow_block+0x376/0x540
[16402.249469] CPU: 1 PID: 2563188 Comm: fsstress Not tainted 5.9.0-rc6+ #8
[16402.249936] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
[16402.250525] RIP: 0010:__btrfs_cow_block+0x376/0x540
[16402.252417] RSP: 0018:ffff9cca40e578b0 EFLAGS: 00010282
[16402.252787] RAX: 0000000000000025 RBX: 0000000000000002 RCX: ffff9132bbd19388
[16402.253278] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9132bbd19380
[16402.254063] RBP: ffff9132b41a49c0 R08: 0000000000000000 R09: 0000000000000000
[16402.254887] R10: 0000000000000000 R11: ffff91324758b080 R12: ffff91326ef17ce0
[16402.255694] R13: ffff91325fc0f000 R14: ffff91326ef176b0 R15: ffff9132815e2000
[16402.256321] FS: 00007f542c6d7b80(0000) GS:ffff9132bbd00000(0000) knlGS:0000000000000000
[16402.256973] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16402.257374] CR2: 00007f127b83f250 CR3: 0000000133480002 CR4: 0000000000370ee0
[16402.257867] Call Trace:
[16402.258072] btrfs_cow_block+0x109/0x230
[16402.258356] btrfs_search_slot+0x530/0x9d0
[16402.258655] btrfs_lookup_file_extent+0x37/0x40
[16402.259155] __btrfs_drop_extents+0x13c/0xd60
[16402.259628] ? btrfs_block_rsv_migrate+0x4f/0xb0
[16402.259949] btrfs_replace_file_extents+0x190/0x820
[16402.260873] btrfs_clone+0x9ae/0xc00
[16402.261139] btrfs_extent_same_range+0x66/0x90
[16402.261771] btrfs_remap_file_range+0x353/0x3b1
[16402.262333] vfs_dedupe_file_range_one.part.0+0xd5/0x140
[16402.262821] vfs_dedupe_file_range+0x189/0x220
[16402.263150] do_vfs_ioctl+0x552/0x700
[16402.263662] __x64_sys_ioctl+0x62/0xb0
[16402.264023] do_syscall_64+0x33/0x40
[16402.264364] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[16402.264862] RIP: 0033:0x7f542c7d15cb
[16402.266901] RSP: 002b:00007ffd35944ea8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[16402.267627] RAX: ffffffffffffffda RBX: 00000000009d1968 RCX: 00007f542c7d15cb
[16402.268298] RDX: 00000000009d2490 RSI: 00000000c0189436 RDI: 0000000000000003
[16402.268958] RBP: 00000000009d2520 R08: 0000000000000036 R09: 00000000009d2e64
[16402.269726] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[16402.270659] R13: 000000000001f000 R14: 00000000009d1970 R15: 00000000009d2e80
[16402.271498] irq event stamp: 0
[16402.271846] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[16402.272497] hardirqs last disabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273343] softirqs last enabled at (0): [<ffffffff910dbf59>] copy_process+0x6b9/0x1ba0
[16402.273905] softirqs last disabled at (0): [<0000000000000000>] 0x0
[16402.274338] ---[ end trace 737874a5a41a8236 ]---
[16402.274669] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.276179] BTRFS info (device dm-9): forced readonly
[16402.277046] BTRFS: error (device dm-9) in btrfs_replace_file_extents:2723: errno=-2 No such entry
[16402.278744] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.279968] BTRFS: error (device dm-9) in __btrfs_cow_block:1074: errno=-2 No such entry
[16402.280582] BTRFS info (device dm-9): balance: ended with status: -30
The problem here is that as soon as we allocate the new block it is
locked and marked dirty in the btree inode. This means that we could
attempt to writeback this block and need to lock the extent buffer.
However we're not unlocking it here and thus we deadlock.
Fix this by unlocking the cow block if we have any errors inside of
__btrfs_cow_block, and also free it so we do not leak it.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
e3c57805f8 |
btrfs: remove BTRFS_INODE_READDIO_NEED_LOCK
Since we now perform direct reads using i_rwsem, we can remove this inode flag used to co-ordinate unlocked reads. The truncate call takes i_rwsem. This means it is correctly synchronized with concurrent direct reads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <jth@kernel.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
92e26df43b |
btrfs: return error if we're unable to read device stats
I noticed when fixing device stats for seed devices that we simply threw away the return value from btrfs_search_slot(). This is because we may not have stat items, but we could very well get an error, and thus miss reporting the error up the chain. Fix this by returning ret if it's an actual error, and then stop trying to init the rest of the devices stats and return the error up the chain. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
124604eb50 |
btrfs: init device stats for seed devices
We recently started recording device stats across the fleet, and noticed a large increase in messages such as this BTRFS warning (device dm-0): get dev_stats failed, not yet valid on our tiers that use seed devices for their root devices. This is because we do not initialize the device stats for any seed devices if we have a sprout device and mount using that sprout device. The basic steps for reproducing are: $ mkfs seed device $ mount seed device # fill seed device $ umount seed device $ btrfstune -S 1 seed device $ mount seed device $ btrfs device add -f sprout device /mnt/wherever $ umount /mnt/wherever $ mount sprout device /mnt/wherever $ btrfs device stats /mnt/wherever This will fail with the above message in dmesg. Fix this by iterating over the fs_devices->seed if they exist in btrfs_init_dev_stats. This fixed the problem and properly reports the stats for both devices. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ rename to btrfs_device_init_dev_stats ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
905eb88bce |
btrfs: remove struct extent_io_ops
It's no longer used just remove the function and any related code which was initialising it for inodes. No functional changes. Removing 8 bytes from extent_io_tree in turn reduces size of other structures where it is embedded, notably btrfs_inode where it reduces size by 24 bytes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1b36294a6c |
btrfs: call submit_bio_hook directly for metadata pages
No need to go through a function pointer indirection simply call submit_bio_hook directly by exporting and renaming the helper to btrfs_submit_metadata_bio. This makes the code more readable and should result in somewhat faster code due to no longer paying the price for specualtive attack mitigations that come with indirect function calls. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
908930f3ed |
btrfs: stop calling submit_bio_hook for data inodes
Instead export and rename the function to btrfs_submit_data_bio and call it directly in submit_one_bio. This avoids paying the cost for speculative attacks mitigations and improves code readability. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
be17b3afc4 |
btrfs: don't opencode is_data_inode in end_bio_extent_readpage
Use the is_data_inode helper. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cd0537449c |
btrfs: call submit_bio_hook directly in submit_one_bio
BTRFS has 2 inode types (for the purposes of the code in submit_one_bio) - ordinary data inodes (including the freespace inode) and the btree inode. Both of these implement submit_bio_hook so btrfsic_submit_bio can never be called from submit_one_bio so just remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1f03d9cfda |
btrfs: remove extent_io_ops::readpage_end_io_hook
It's no longer used so let's remove it. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9a446d6a9f |
btrfs: replace readpage_end_io_hook with direct calls
Don't call readpage_end_io_hook for the btree inode. Instead of relying on indirect calls to implement metadata buffer validation simply check if the inode whose page we are processing equals the btree inode. If it does call the necessary function. This is an improvement in 2 directions: 1. We aren't paying the penalty of indirect calls in a post-speculation attacks world. 2. The function is now named more explicitly so it's obvious what's going on This is in preparation to removing struct extent_io_ops altogether. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9c2b4e0347 |
btrfs: send, recompute reference path after orphanization of a directory
During an incremental send, when an inode has multiple new references we might end up emitting rename operations for orphanizations that have a source path that is no longer valid due to a previous orphanization of some directory inode. This causes the receiver to fail since it tries to rename a path that does not exists. Example reproducer: $ cat reproducer.sh #!/bin/bash mkfs.btrfs -f /dev/sdi >/dev/null mount /dev/sdi /mnt/sdi touch /mnt/sdi/f1 touch /mnt/sdi/f2 mkdir /mnt/sdi/d1 mkdir /mnt/sdi/d1/d2 # Filesystem looks like: # # . (ino 256) # |----- f1 (ino 257) # |----- f2 (ino 258) # |----- d1/ (ino 259) # |----- d2/ (ino 260) btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1 btrfs send -f /tmp/snap1.send /mnt/sdi/snap1 # Now do a series of changes such that: # # *) inode 258 has one new hardlink and the previous name changed # # *) both names conflict with the old names of two other inodes: # # 1) the new name "d1" conflicts with the old name of inode 259, # under directory inode 256 (root) # # 2) the new name "d2" conflicts with the old name of inode 260 # under directory inode 259 # # *) inodes 259 and 260 now have the old names of inode 258 # # *) inode 257 is now located under inode 260 - an inode with a number # smaller than the inode (258) for which we created a second hard # link and swapped its names with inodes 259 and 260 # ln /mnt/sdi/f2 /mnt/sdi/d1/f2_link mv /mnt/sdi/f1 /mnt/sdi/d1/d2/f1 # Swap d1 and f2. mv /mnt/sdi/d1 /mnt/sdi/tmp mv /mnt/sdi/f2 /mnt/sdi/d1 mv /mnt/sdi/tmp /mnt/sdi/f2 # Swap d2 and f2_link mv /mnt/sdi/f2/d2 /mnt/sdi/tmp mv /mnt/sdi/f2/f2_link /mnt/sdi/f2/d2 mv /mnt/sdi/tmp /mnt/sdi/f2/f2_link # Filesystem now looks like: # # . (ino 256) # |----- d1 (ino 258) # |----- f2/ (ino 259) # |----- f2_link/ (ino 260) # | |----- f1 (ino 257) # | # |----- d2 (ino 258) btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2 btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2 mkfs.btrfs -f /dev/sdj >/dev/null mount /dev/sdj /mnt/sdj btrfs receive -f /tmp/snap1.send /mnt/sdj btrfs receive -f /tmp/snap2.send /mnt/sdj umount /mnt/sdi umount /mnt/sdj When executed the receive of the incremental stream fails: $ ./reproducer.sh Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1' At subvol /mnt/sdi/snap1 Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2' At subvol /mnt/sdi/snap2 At subvol snap1 At snapshot snap2 ERROR: rename d1/d2 -> o260-6-0 failed: No such file or directory This happens because: 1) When processing inode 257 we end up computing the name for inode 259 because it is an ancestor in the send snapshot, and at that point it still has its old name, "d1", from the parent snapshot because inode 259 was not yet processed. We then cache that name, which is valid until we start processing inode 259 (or set the progress to 260 after processing its references); 2) Later we start processing inode 258 and collecting all its new references into the list sctx->new_refs. The first reference in the list happens to be the reference for name "d1" while the reference for name "d2" is next (the last element of the list). We compute the full path "d1/d2" for this second reference and store it in the reference (its ->full_path member). The path used for the new parent directory was "d1" and not "f2" because inode 259, the new parent, was not yet processed; 3) When we start processing the new references at process_recorded_refs() we start with the first reference in the list, for the new name "d1". Because there is a conflicting inode that was not yet processed, which is directory inode 259, we orphanize it, renaming it from "d1" to "o259-6-0"; 4) Then we start processing the new reference for name "d2", and we realize it conflicts with the reference of inode 260 in the parent snapshot. So we issue an orphanization operation for inode 260 by emitting a rename operation with a destination path of "o260-6-0" and a source path of "d1/d2" - this source path is the value we stored in the reference earlier at step 2), corresponding to the ->full_path member of the reference, however that path is no longer valid due to the orphanization of the directory inode 259 in step 3). This makes the receiver fail since the path does not exists, it should have been "o259-6-0/d2". Fix this by recomputing the full path of a reference before emitting an orphanization if we previously orphanized any directory, since that directory could be a parent in the new path. This is a rare scenario so keeping it simple and not checking if that previously orphanized directory is in fact an ancestor of the inode we are trying to orphanize. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
98272bb77b |
btrfs: send, orphanize first all conflicting inodes when processing references
When doing an incremental send it is possible that when processing the new references for an inode we end up issuing rename or link operations that have an invalid path, which contains the orphanized name of a directory before we actually orphanized it, causing the receiver to fail. The following reproducer triggers such scenario: $ cat reproducer.sh #!/bin/bash mkfs.btrfs -f /dev/sdi >/dev/null mount /dev/sdi /mnt/sdi touch /mnt/sdi/a touch /mnt/sdi/b mkdir /mnt/sdi/testdir # We want "a" to have a lower inode number then "testdir" (257 vs 259). mv /mnt/sdi/a /mnt/sdi/testdir/a # Filesystem looks like: # # . (ino 256) # |----- testdir/ (ino 259) # | |----- a (ino 257) # | # |----- b (ino 258) btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap1 btrfs send -f /tmp/snap1.send /mnt/sdi/snap1 # Now rename 259 to "testdir_2", then change the name of 257 to # "testdir" and make it a direct descendant of the root inode (256). # Also create a new link for inode 257 with the old name of inode 258. # By swapping the names and location of several inodes and create a # nasty dependency chain of rename and link operations. mv /mnt/sdi/testdir/a /mnt/sdi/a2 touch /mnt/sdi/testdir/a mv /mnt/sdi/b /mnt/sdi/b2 ln /mnt/sdi/a2 /mnt/sdi/b mv /mnt/sdi/testdir /mnt/sdi/testdir_2 mv /mnt/sdi/a2 /mnt/sdi/testdir # Filesystem now looks like: # # . (ino 256) # |----- testdir_2/ (ino 259) # | |----- a (ino 260) # | # |----- testdir (ino 257) # |----- b (ino 257) # |----- b2 (ino 258) btrfs subvolume snapshot -r /mnt/sdi /mnt/sdi/snap2 btrfs send -f /tmp/snap2.send -p /mnt/sdi/snap1 /mnt/sdi/snap2 mkfs.btrfs -f /dev/sdj >/dev/null mount /dev/sdj /mnt/sdj btrfs receive -f /tmp/snap1.send /mnt/sdj btrfs receive -f /tmp/snap2.send /mnt/sdj umount /mnt/sdi umount /mnt/sdj When running the reproducer, the receive of the incremental send stream fails: $ ./reproducer.sh Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1' At subvol /mnt/sdi/snap1 Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2' At subvol /mnt/sdi/snap2 At subvol snap1 At snapshot snap2 ERROR: link b -> o259-6-0/a failed: No such file or directory The problem happens because of the following: 1) Before we start iterating the list of new references for inode 257, we generate its current path and store it at @valid_path, done at the very beginning of process_recorded_refs(). The generated path is "o259-6-0/a", containing the orphanized name for inode 259; 2) Then we iterate over the list of new references, which has the references "b" and "testdir" in that specific order; 3) We process reference "b" first, because it is in the list before reference "testdir". We then issue a link operation to create the new reference "b" using a target path corresponding to the content at @valid_path, which corresponds to "o259-6-0/a". However we haven't yet orphanized inode 259, its name is still "testdir", and not "o259-6-0". The orphanization of 259 did not happen yet because we will process the reference named "testdir" for inode 257 only in the next iteration of the loop that goes over the list of new references. Fix the issue by having a preliminar iteration over all the new references at process_recorded_refs(). This iteration is responsible only for doing the orphanization of other inodes that have and old reference that conflicts with one of the new references of the inode we are currently processing. The emission of rename and link operations happen now in the next iteration of the new references. A test case for fstests will follow soon. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1465af12e2 |
btrfs: tree-checker: fix false alert caused by legacy btrfs root item
Commit |
|
|
|
e97659cefe |
btrfs: use unaligned helpers for stack and header set/get helpers
In the definitions generated by BTRFS_SETGET_HEADER_FUNCS there's direct pointer assignment but we should use the helpers for unaligned access for clarity. It hasn't been a problem so far because of the natural alignment. Similarly for BTRFS_SETGET_STACK_FUNCS, that usually get a structure from stack that has an aligned start but some members may not be aligned due to packing. This as well hasn't caused problems so far. Move the put/get_unaligned_le8 stubs to ctree.h so we can use them. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6994ca367c |
btrfs: free-space-cache: use unaligned helpers to access data
The free space inode stores the tracking data, checksums etc, using the io_ctl structure and moving the pointers. The data are generally aligned to at least 4 bytes (u32 for CRC) so it's not completely unaligned but for clarity we should use the proper helpers whenever a struct is initialized from io_ctl->cur pointer. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e2f896b318 |
btrfs: send: use helpers for unaligned access to header members
The header is mapped onto the send buffer and thus its members may be potentially unaligned so use the helpers instead of directly assigning the pointers. This has worked so far but let's use the helpers to make that clear. Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2c53a14dd3 |
btrfs: use own btree inode io_tree owner id
Btree inode is special compared to all other inode extent io_trees, although it has a btrfs inode, it doesn't have the track_uptodate bit at all. This means a lot of things like extent locking doesn't even need to be applied to btree io tree. Since it's so special, adds a new owner value for it to make debuging a little easier. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6b613cc97f |
btrfs: reschedule when cloning lots of extents
We have several occurrences of a soft lockup from fstest's generic/175 testcase, which look more or less like this one: watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [xfs_io:10030] Kernel panic - not syncing: softlockup: hung tasks CPU: 0 PID: 10030 Comm: xfs_io Tainted: G L 5.9.0-rc5+ #768 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4-rebuilt.opensuse.org 04/01/2014 Call Trace: <IRQ> dump_stack+0x77/0xa0 panic+0xfa/0x2cb watchdog_timer_fn.cold+0x85/0xa5 ? lockup_detector_update_enable+0x50/0x50 __hrtimer_run_queues+0x99/0x4c0 ? recalibrate_cpu_khz+0x10/0x10 hrtimer_run_queues+0x9f/0xb0 update_process_times+0x28/0x80 tick_handle_periodic+0x1b/0x60 __sysvec_apic_timer_interrupt+0x76/0x210 asm_call_on_stack+0x12/0x20 </IRQ> sysvec_apic_timer_interrupt+0x7f/0x90 asm_sysvec_apic_timer_interrupt+0x12/0x20 RIP: 0010:btrfs_tree_unlock+0x91/0x1a0 [btrfs] RSP: 0018:ffffc90007123a58 EFLAGS: 00000282 RAX: ffff8881cea2fbe0 RBX: ffff8881cea2fbe0 RCX: 0000000000000000 RDX: ffff8881d23fd200 RSI: ffffffff82045220 RDI: ffff8881cea2fba0 RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000032 R10: 0000160000000000 R11: 0000000000001000 R12: 0000000000001000 R13: ffff8882357fd5b0 R14: ffff88816fa76e70 R15: ffff8881cea2fad0 ? btrfs_tree_unlock+0x15b/0x1a0 [btrfs] btrfs_release_path+0x67/0x80 [btrfs] btrfs_insert_replace_extent+0x177/0x2c0 [btrfs] btrfs_replace_file_extents+0x472/0x7c0 [btrfs] btrfs_clone+0x9ba/0xbd0 [btrfs] btrfs_clone_files.isra.0+0xeb/0x140 [btrfs] ? file_update_time+0xcd/0x120 btrfs_remap_file_range+0x322/0x3b0 [btrfs] do_clone_file_range+0xb7/0x1e0 vfs_clone_file_range+0x30/0xa0 ioctl_file_clone+0x8a/0xc0 do_vfs_ioctl+0x5b2/0x6f0 __x64_sys_ioctl+0x37/0xa0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f87977fc247 RSP: 002b:00007ffd51a2f6d8 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f87977fc247 RDX: 00007ffd51a2f710 RSI: 000000004020940d RDI: 0000000000000003 RBP: 0000000000000004 R08: 00007ffd51a79080 R09: 0000000000000000 R10: 00005621f11352f2 R11: 0000000000000206 R12: 0000000000000000 R13: 0000000000000000 R14: 00005621f128b958 R15: 0000000080000000 Kernel Offset: disabled ---[ end Kernel panic - not syncing: softlockup: hung tasks ]--- All of these lockup reports have the call chain btrfs_clone_files() -> btrfs_clone() in common. btrfs_clone_files() calls btrfs_clone() with both source and destination extents locked and loops over the source extent to create the clones. Conditionally reschedule in the btrfs_clone() loop, to give some time back to other processes. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bae12df966 |
btrfs: use kvcalloc for allocation in btrfs_ioctl_send()
Replace kvzalloc() call with kvcalloc() that also checks the size internally. There's a standalone overflow check in the function so we can return invalid parameter combination. Use array_size() helper to compute the memory size for clone_sources_tmp. Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Denis Efremov <efremov@linux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8eb2fd0015 |
btrfs: use kvzalloc() to allocate clone_roots in btrfs_ioctl_send()
btrfs_ioctl_send() used open-coded kvzalloc implementation earlier.
The code was accidentally replaced with kzalloc() call [1]. Restore
the original code by using kvzalloc() to allocate sctx->clone_roots.
[1] https://patchwork.kernel.org/patch/9757891/#20529627
Fixes:
|
|
|
|
c0a4360305 |
btrfs: remove inode argument from btrfs_start_ordered_extent
The passed in ordered_extent struct is always well-formed and contains the inode making the explicit argument redundant. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
510f85edf1 |
btrfs: remove inode argument from add_pending_csums
It's used to reference the csum root which can be done from the trans handle as well. Simplify the signature and while at it also remove the noinline attribute as the function uses only at most 16 bytes of stack space. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c38c877fc |
btrfs: sink inode argument in insert_ordered_extent_file_extent
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
71fe0a55da |
btrfs: switch btrfs_remove_ordered_extent to btrfs_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
633cc816f7 |
btrfs: clean BTRFS_I usage in btrfs_destroy_inode
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0f20881249 |
btrfs: open code extent_read_full_page to its sole caller
This makes reading the code a tad easier by decreasing the level of indirection by one. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fd513000eb |
btrfs: sink mirror_num argument in __do_readpage
It's always set to 0 by the 2 callers so move it inside __do_readpage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6f15af6060 |
btrfs: sink read_flags argument into extent_read_full_page
It's always set to 0 by its sole caller - btrfs_readpage. Simply remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
003c286aef |
btrfs: sink mirror_num argument in extent_read_full_page
It's always set to 0 from the sole caller - btrfs_readpage. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c1be9c1ad5 |
btrfs: promote extent_read_full_page to btrfs_readpage
Now that btrfs_readpage is the only caller of extent_read_full_page the latter can be open coded in the former. Use the occassion to rename __extent_read_full_page to extent_read_full_page. To facillitate this change submit_one_bio has to be exported as well. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
72cffee463 |
btrfs: remove mirror_num argument from extent_read_full_page
It's called only from btrfs_readpage which always passes 0 so just sink the argument into extent_read_full_page. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1a5ee1e626 |
btrfs: remove btrfs_get_extent indirection from __do_readpage
Now that this function is only responsible for reading data pages it's no longer necessary to pass get_extent_t parameter across several layers of functions. This patch removes this parameter from multiple functions: __get_extent_map/__do_readpage/__extent_read_full_page/ extent_read_full_page and simply calls btrfs_get_extent directly in __get_extent_map. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
208d6341e8 |
btrfs: remove btree_get_extent
The sole purpose of this function was to satisfy the requirements of __do_readpage. Since that function is no longer used to read metadata pages the need to keep btree_get_extent around has also disappeared. Simply remove it. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0420177c08 |
btrfs: simplify metadata pages reading
Metadata pages currently use __do_readpage to read metadata pages, unfortunately this function is also used to deal with ordinary data pages. This makes the metadata pages reading code to go through multiple hoops in order to adhere to __do_readpage invariants. Most of these are necessary for data pages which could be compressed. For metadata it's enough to simply build a bio and submit it. To this effect simply call submit_extent_page directly from read_extent_buffer_pages which is the only callpath used to populate extent_buffers with data. This in turn enables further cleanups. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2f1d3e4b93 |
btrfs: remove btree_readpage
There is no way for this function to be called as ->readpage() since it's called from generic_file_buffered_read/filemap_fault/do_read_cache_page/readhead code. BTRFS doesn't utilize the first 3 for the btree inode and implements it's owon readhead mechanism. So simply remove the function. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bb56f02f26 |
btrfs: reschedule if necessary when logging directory items
Logging directories with many entries can take a significant amount of time, and in some cases monopolize a cpu/core for a long time if the logging task doesn't happen to block often enough. Johannes and Lu Fengqi reported test case generic/041 triggering a soft lockup when the kernel has CONFIG_SOFTLOCKUP_DETECTOR=y. For this test case we log an inode with 3002 hard links, and because the test removed one hard link before fsyncing the file, the inode logging causes the parent directory do be logged as well, which has 6004 directory items to log (3002 BTRFS_DIR_ITEM_KEY items plus 3002 BTRFS_DIR_INDEX_KEY items), so it can take a significant amount of time and trigger the soft lockup. So just make tree-log.c:log_dir_items() reschedule when necessary, releasing the current search path before doing so and then resume from where it was before the reschedule. The stack trace produced when the soft lockup happens is the following: [10480.277653] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [xfs_io:28172] [10480.279418] Modules linked in: dm_thin_pool dm_persistent_data (...) [10480.284915] irq event stamp: 29646366 [10480.285987] hardirqs last enabled at (29646365): [<ffffffff85249b66>] __slab_alloc.constprop.0+0x56/0x60 [10480.288482] hardirqs last disabled at (29646366): [<ffffffff8579b00d>] irqentry_enter+0x1d/0x50 [10480.290856] softirqs last enabled at (4612): [<ffffffff85a00323>] __do_softirq+0x323/0x56c [10480.293615] softirqs last disabled at (4483): [<ffffffff85800dbf>] asm_call_on_stack+0xf/0x20 [10480.296428] CPU: 2 PID: 28172 Comm: xfs_io Not tainted 5.9.0-rc4-default+ #1248 [10480.298948] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 [10480.302455] RIP: 0010:__slab_alloc.constprop.0+0x19/0x60 [10480.304151] Code: 86 e8 31 75 21 00 66 66 2e 0f 1f 84 00 00 00 (...) [10480.309558] RSP: 0018:ffffadbe09397a58 EFLAGS: 00000282 [10480.311179] RAX: ffff8a495ab92840 RBX: 0000000000000282 RCX: 0000000000000006 [10480.313242] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff85249b66 [10480.315260] RBP: ffff8a497d04b740 R08: 0000000000000001 R09: 0000000000000001 [10480.317229] R10: ffff8a497d044800 R11: ffff8a495ab93c40 R12: 0000000000000000 [10480.319169] R13: 0000000000000000 R14: 0000000000000c40 R15: ffffffffc01daf70 [10480.321104] FS: 00007fa1dc5c0e40(0000) GS:ffff8a497da00000(0000) knlGS:0000000000000000 [10480.323559] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [10480.325235] CR2: 00007fa1dc5befb8 CR3: 0000000004f8a006 CR4: 0000000000170ea0 [10480.327259] Call Trace: [10480.328286] ? overwrite_item+0x1f0/0x5a0 [btrfs] [10480.329784] __kmalloc+0x831/0xa20 [10480.331009] ? btrfs_get_32+0xb0/0x1d0 [btrfs] [10480.332464] overwrite_item+0x1f0/0x5a0 [btrfs] [10480.333948] log_dir_items+0x2ee/0x570 [btrfs] [10480.335413] log_directory_changes+0x82/0xd0 [btrfs] [10480.336926] btrfs_log_inode+0xc9b/0xda0 [btrfs] [10480.338374] ? init_once+0x20/0x20 [btrfs] [10480.339711] btrfs_log_inode_parent+0x8d3/0xd10 [btrfs] [10480.341257] ? dget_parent+0x97/0x2e0 [10480.342480] btrfs_log_dentry_safe+0x3a/0x50 [btrfs] [10480.343977] btrfs_sync_file+0x24b/0x5e0 [btrfs] [10480.345381] do_fsync+0x38/0x70 [10480.346483] __x64_sys_fsync+0x10/0x20 [10480.347703] do_syscall_64+0x2d/0x70 [10480.348891] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [10480.350444] RIP: 0033:0x7fa1dc80970b [10480.351642] Code: 0f 05 48 3d 00 f0 ff ff 77 45 c3 0f 1f 40 00 48 (...) [10480.356952] RSP: 002b:00007fffb3d081d0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a [10480.359458] RAX: ffffffffffffffda RBX: 0000562d93d45e40 RCX: 00007fa1dc80970b [10480.361426] RDX: 0000562d93d44ab0 RSI: 0000562d93d45e60 RDI: 0000000000000003 [10480.363367] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007fa1dc7b2a40 [10480.365317] R10: 0000562d93d0e366 R11: 0000000000000293 R12: 0000000000000001 [10480.367299] R13: 0000562d93d45290 R14: 0000562d93d45e40 R15: 0000562d93d45e60 Link: https://lore.kernel.org/linux-btrfs/20180713090216.GC575@fnst.localdomain/ Reported-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> CC: stable@vger.kernel.org # 4.4+ Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
49ea112da0 |
btrfs: do not create raid sysfs entries under any locks
While running xfstests btrfs/177 I got the following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc3+ #5 Not tainted ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff97066aa56760 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc+0x37/0x270 alloc_inode+0x82/0xb0 iget_locked+0x10d/0x2c0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x240 sysfs_get_tree+0x16/0x40 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (kernfs_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 kernfs_add_one+0x23/0x150 kernfs_create_dir_ns+0x7a/0xb0 sysfs_create_dir_ns+0x60/0xb0 kobject_add_internal+0xc0/0x2c0 kobject_add+0x6e/0x90 btrfs_sysfs_add_block_group_type+0x102/0x160 btrfs_make_block_group+0x167/0x230 btrfs_alloc_chunk+0x54f/0xb80 btrfs_chunk_alloc+0x18e/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_insert_empty_items+0x64/0xb0 btrfs_new_inode+0x225/0x730 btrfs_create+0xab/0x1f0 lookup_open.isra.0+0x52d/0x690 path_openat+0x2a7/0x9e0 do_filp_open+0x75/0x100 do_sys_openat2+0x7b/0x130 __x64_sys_openat+0x46/0x70 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_lookup_inode+0x2a/0x8f __btrfs_update_delayed_inode+0x80/0x240 btrfs_commit_inode_delayed_inode+0x119/0x120 btrfs_evict_inode+0x357/0x500 evict+0xcf/0x1f0 do_unlinkat+0x1a9/0x2b0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffff9fd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff9fd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff9706629780e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 1 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #5 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb8 check_noncircular+0x12d/0x150 __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa7/0x3d0 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x41/0x50 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because when we link in a block group with a new raid index type we'll create the corresponding sysfs entries for it. This is problematic because while restriping we're holding the chunk_mutex, and while mounting we're holding the tree locks. Fixing this isn't pretty, we move the call to the sysfs stuff into the btrfs_create_pending_block_groups() work, where we're not holding any locks. This creates a slight race where other threads could see that there's no sysfs kobj for that raid type, and race to create the sysfs dir. Fix this by wrapping the creation in space_info->lock, so we only get one thread calling kobject_add() for the new directory. We don't worry about the lock on cleanup as it only gets deleted on unmount. On mount it's more straightforward, we loop through the space_infos already, just check every raid index in each space_info and added the sysfs entries for the corresponding block groups. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7280490500 |
btrfs: kill the RCU protection for fs_info->space_info
We have this thing wrapped in an RCU lock, but it's really not needed. We create all the space_info's on mount, and we destroy them on unmount. The list never changes and we're protected from messing with it by the normal mount/umount path, so kill the RCU stuff around it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7269ddd2f6 |
btrfs: improve error message in setup_items_for_insert
Reword and update formats to match variable types. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update formats ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
da9ffb242c |
btrfs: add kerneldoc for setup_items_for_insert
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fc0d82e103 |
btrfs: sink total_data parameter in setup_items_for_insert
That parameter can easily be derived based on the "data_size" and "nr" parameters exploit this fact to simply the function's signature. No functional changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3dc9dc8969 |
btrfs: eliminate total_size parameter from setup_items_for_insert
The value of this argument can be derived from the total_data as it's simply the value of the data size + size of btrfs_items being touched. Move the parameter calculation inside the function. This results in a simpler interface and also a minor size reduction: ./scripts/bloat-o-meter ctree.original fs/btrfs/ctree.o add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-34 (-34) Function old new delta btrfs_duplicate_item 260 259 -1 setup_items_for_insert 1200 1190 -10 btrfs_insert_empty_items 177 154 -23 Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fc0716c2f6 |
btrfs: re-arrange statements in setup_items_for_insert
Rearrange statements calculating the offset of the newly added items so that the calculation has to be done only once. No functional change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7573df5547 |
btrfs: sysfs: export supported send stream version
This reports the latest send stream version supported by the kernel as the feature in /sys/fs/btrfs/features/send_stream_version . Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c9a949af13 |
btrfs: send: use btrfs_file_extent_end() in send_write_or_clone()
send_write_or_clone() basically has an open-coded copy of btrfs_file_extent_end() except that it (incorrectly) aligns to PAGE_SIZE instead of sectorsize. Fix and simplify the code by using btrfs_file_extent_end(). Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8c7d9fe06f |
btrfs: send: avoid copying file data
send_write() currently copies from the page cache to sctx->read_buf, and then from sctx->read_buf to sctx->send_buf. Similarly, send_hole() zeroes sctx->read_buf and then copies from sctx->read_buf to sctx->send_buf. However, if we write the TLV header manually, we can copy to sctx->send_buf directly and get rid of sctx->read_buf. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a9b2e0de92 |
btrfs: send: get rid of i_size logic in send_write()
send_write()/fill_read_buf() have some logic for avoiding reading past i_size. However, everywhere that we call send_write()/send_extent_data(), we've already clamped the length down to i_size. Get rid of the i_size handling, which simplifies the next change. Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0cbb5bdfea |
btrfs: rename btrfs_insert_clone_extent() to a more generic name
Now that we use the same mechanism to replace all the extents in a file range with either a hole, an existing extent (when cloning) or a new extent (when using fallocate), the name of btrfs_insert_clone_extent() no longer reflects its genericity. So rename it to btrfs_insert_replace_extent(), since what it does is to either insert an existing extent or a new extent into a file range. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
306bfec02b |
btrfs: rename btrfs_punch_hole_range() to a more generic name
The function btrfs_punch_hole_range() is now used to replace all the file extents in a given file range with an extent described in the given struct btrfs_replace_extent_info argument. This extent can either be an existing extent that is being cloned or it can be a new extent (namely a prealloc extent). When that argument is NULL it only punches a hole (drops all the existing extents) in the file range. So rename the function to btrfs_replace_file_extents(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bf385648fa |
btrfs: rename struct btrfs_clone_extent_info to a more generic name
Now that we can use btrfs_clone_extent_info to convey information for a new prealloc extent as well, and not just for existing extents that are being cloned, rename it to btrfs_replace_extent_info, which reflects the fact that this is now more generic and it is used to replace all existing extents in a file range with the extent described by the structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb870f6cdd |
btrfs: remove item_size member of struct btrfs_clone_extent_info
The value of item_size of struct btrfs_clone_extent_info is always set to the size of a non-inline file extent item, and in fact the infrastructure that uses this structure (btrfs_punch_hole_range()) does not work with inline file extents at all (and it is not supposed to). So just remove that field from the structure and use directly sizeof(struct btrfs_file_extent_item) instead. Also assert that the file extent type is not inline at btrfs_insert_clone_extent(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8fccebfa53 |
btrfs: fix metadata reservation for fallocate that leads to transaction aborts
When doing an fallocate(), specially a zero range operation, we assume that reserving 3 units of metadata space is enough, that at most we touch one leaf in subvolume/fs tree for removing existing file extent items and inserting a new file extent item. This assumption is generally true for most common use cases. However when we end up needing to remove file extent items from multiple leaves, we can end up failing with -ENOSPC and abort the current transaction, turning the filesystem to RO mode. When this happens a stack trace like the following is dumped in dmesg/syslog: [ 1500.620934] ------------[ cut here ]------------ [ 1500.620938] BTRFS: Transaction aborted (error -28) [ 1500.620973] WARNING: CPU: 2 PID: 30807 at fs/btrfs/inode.c:9724 __btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.620974] Modules linked in: btrfs intel_rapl_msr intel_rapl_common kvm_intel (...) [ 1500.621010] CPU: 2 PID: 30807 Comm: xfs_io Tainted: G W 5.9.0-rc3-btrfs-next-67 #1 [ 1500.621012] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014 [ 1500.621023] RIP: 0010:__btrfs_prealloc_file_range+0x512/0x570 [btrfs] [ 1500.621026] Code: 8b 40 50 f0 48 (...) [ 1500.621028] RSP: 0018:ffffb05fc8803ca0 EFLAGS: 00010286 [ 1500.621030] RAX: 0000000000000000 RBX: ffff9608af276488 RCX: 0000000000000000 [ 1500.621032] RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff [ 1500.621033] RBP: ffffb05fc8803d90 R08: 0000000000000001 R09: 0000000000000001 [ 1500.621035] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000003200000 [ 1500.621037] R13: 00000000ffffffe4 R14: ffff9608af275fe8 R15: ffff9608af275f60 [ 1500.621039] FS: 00007fb5b2368ec0(0000) GS:ffff9608b6600000(0000) knlGS:0000000000000000 [ 1500.621041] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1500.621043] CR2: 00007fb5b2366fb8 CR3: 0000000202d38005 CR4: 00000000003706e0 [ 1500.621046] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1500.621047] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1500.621049] Call Trace: [ 1500.621076] btrfs_prealloc_file_range+0x10/0x20 [btrfs] [ 1500.621087] btrfs_fallocate+0xccd/0x1280 [btrfs] [ 1500.621108] vfs_fallocate+0x14d/0x290 [ 1500.621112] ksys_fallocate+0x3a/0x70 [ 1500.621117] __x64_sys_fallocate+0x1a/0x20 [ 1500.621120] do_syscall_64+0x33/0x80 [ 1500.621123] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1500.621126] RIP: 0033:0x7fb5b248c477 [ 1500.621128] Code: 89 7c 24 08 (...) [ 1500.621130] RSP: 002b:00007ffc7bee9060 EFLAGS: 00000293 ORIG_RAX: 000000000000011d [ 1500.621132] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fb5b248c477 [ 1500.621134] RDX: 0000000000000000 RSI: 0000000000000010 RDI: 0000000000000003 [ 1500.621136] RBP: 0000557718faafd0 R08: 0000000000000000 R09: 0000000000000000 [ 1500.621137] R10: 0000000003200000 R11: 0000000000000293 R12: 0000000000000010 [ 1500.621139] R13: 0000557718faafb0 R14: 0000557718faa480 R15: 0000000000000003 [ 1500.621151] irq event stamp: 1026217 [ 1500.621154] hardirqs last enabled at (1026223): [<ffffffffba965570>] console_unlock+0x500/0x5c0 [ 1500.621156] hardirqs last disabled at (1026228): [<ffffffffba9654c7>] console_unlock+0x457/0x5c0 [ 1500.621159] softirqs last enabled at (1022486): [<ffffffffbb6003dc>] __do_softirq+0x3dc/0x606 [ 1500.621161] softirqs last disabled at (1022477): [<ffffffffbb4010b2>] asm_call_on_stack+0x12/0x20 [ 1500.621162] ---[ end trace 2955b08408d8b9d4 ]--- [ 1500.621167] BTRFS: error (device sdj) in __btrfs_prealloc_file_range:9724: errno=-28 No space left When we use fallocate() internally, for reserving an extent for a space cache, inode cache or relocation, we can't hit this problem since either there aren't any file extent items to remove from the subvolume tree or there is at most one. When using plain fallocate() it's very unlikely, since that would require having many file extent items representing holes for the target range and crossing multiple leafs - we attempt to increase the range (merge) of such file extent items when punching holes, so at most we end up with 2 file extent items for holes at leaf boundaries. However when using the zero range operation of fallocate() for a large range (100+ MiB for example) that's fairly easy to trigger. The following example reproducer triggers the issue: $ cat reproducer.sh #!/bin/bash umount /dev/sdj &> /dev/null mkfs.btrfs -f -n 16384 -O ^no-holes /dev/sdj > /dev/null mount /dev/sdj /mnt/sdj # Create a 100M file with many file extent items. Punch a hole every 8K # just to speedup the file creation - we could do 4K sequential writes # followed by fsync (or O_SYNC) as well, but that takes a lot of time. file_size=$((100 * 1024 * 1024)) xfs_io -f -c "pwrite -S 0xab -b 10M 0 $file_size" /mnt/sdj/foobar for ((i = 0; i < $file_size; i += 8192)); do xfs_io -c "fpunch $i 4096" /mnt/sdj/foobar done # Force a transaction commit, so the zero range operation will be forced # to COW all metadata extents it need to touch. sync xfs_io -c "fzero 0 $file_size" /mnt/sdj/foobar umount /mnt/sdj $ ./reproducer.sh wrote 104857600/104857600 bytes at offset 0 100 MiB, 10 ops; 0.0669 sec (1.458 GiB/sec and 149.3117 ops/sec) fallocate: No space left on device $ dmesg <shows the same stack trace pasted before> To fix this use the existing infrastructure that hole punching and extent cloning use for replacing a file range with another extent. This deals with doing the removal of file extent items and inserting the new one using an incremental approach, reserving more space when needed and always ensuring we don't leave an implicit hole in the range in case we need to do multiple iterations and a crash happens between iterations. A test case for fstests will follow up soon. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a31a5876fa |
btrfs: remove unused function calc_global_rsv_need_space()
It is not used since commit
|
|
|
|
0725c0c935 |
btrfs: move btrfs_dev_replace_update_device_in_mapping_tree to drop declaration
The function is short and simple, we can get rid of the declaration as it's not necessary for a static function. Move it before its first caller. No functional changes. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c83b60c0e4 |
btrfs: simplify gotos in open_seed_device
The function does not have a common exit block and returns immediatelly so there's no point having the goto. Remove the two cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e493e8f9bc |
btrfs: remove unnecessary tmp variable in btrfs_assign_next_active_device()
We can check the argument value directly, no need for the temporary variable. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1888709d71 |
btrfs: remove tmp variable for list traversal in btrfs_init_dev_replace_tgtdev
In the function btrfs_init_dev_replace_tgtdev(), the local variable devices is used only once, we can remove it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e17125b52b |
btrfs: use sprout device_list_mutex in btrfs_init_devices_late
On a mounted sprout filesystem, all threads now are using the sprout::device_list_mutex, and this is the only code using the seed::device_list_mutex. This patch converts to use the sprouts fs_info->fs_devices->device_list_mutex. The same reasoning holds true here, that device delete is holding the sprout::device_list_mutex. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2fca0db076 |
btrfs: reada: lock all seed/sprout devices in __reada_start_machine
On an fs mounted using a sprout device, the seed fs_devices are maintained in a linked list under fs_info->fs_devices. Each seeds fs_devices also has device_list_mutex initialized to protect against the potential race with delete threads. But the delete thread (at btrfs_rm_device()) is holding the fs_info::fs_devices::device_list_mutex mutex which belongs to sprout device_list_mutex instead of seed device_list_mutex. Moreover, there aren't any significient benefits in using the seed::device_list_mutex instead of sprout::device_list_mutex. So this patch converts them of using the seed::device_list_mutex to sprout::device_list_mutex. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
7ad3912a70 |
btrfs: handle errors in btrfs_sysfs_add_fs_devices
btrfs_sysfs_add_fs_devices() is called by btrfs_sysfs_add_mounted(). btrfs_sysfs_add_mounted() assumes that btrfs_sysfs_add_fs_devices() will either add sysfs entries for all the devices or none. So this patch keeps up to its caller expecatation and cleans up the created sysfs entries if it has to fail at some device in the list. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
30b0e4e0e3 |
btrfs: initialize sysfs devid and device link for seed device
We don't initialize the sysfs devid kobject and device-link yet for the seed devices in an sprouted filesystem. So this patch initializes the seed device devid kobject and the device link in the sysfs. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
53f8a74cbe |
btrfs: split and refactor btrfs_sysfs_remove_devices_dir
Similar to btrfs_sysfs_add_devices_dir()'s refactoring, split btrfs_sysfs_remove_devices_dir() so that we don't have to use the device argument to indicate whether to free all devices or just one device. Export btrfs_sysfs_remove_device() as device operations outside of sysfs.c now calls this instead of btrfs_sysfs_remove_devices_dir(). btrfs_sysfs_remove_devices_dir() is renamed to btrfs_sysfs_remove_fs_devices() to suite its new role. Now, no one outside of sysfs.c calls btrfs_sysfs_remove_fs_devices() so it is redeclared s static. And the same function had to be moved before its first caller. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cd36da2e7e |
btrfs: simplify parameters of btrfs_sysfs_add_devices_dir
When we add a device we need to add it to sysfs, so instead of using the btrfs_sysfs_add_devices_dir() fs_devices argument to specify whether to add a device or all of fs_devices, call the helper function directly btrfs_sysfs_add_device() and thus make it non-static. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6a416a018f |
btrfs: make btrfs_sysfs_remove_devices_dir return void
btrfs_sysfs_remove_devices_dir() return value is unused declare it as void. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
985e233e96 |
btrfs: add btrfs_sysfs_remove_device helper
btrfs_sysfs_remove_devices_dir() removes device link and devid kobject (sysfs entries) for a device or all the devices in the btrfs_fs_devices. In preparation to remove these sysfs entries for the seed as well, add a btrfs_sysfs_remove_device() helper function and avoid code duplication. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
178a16c940 |
btrfs: add btrfs_sysfs_add_device helper
btrfs_sysfs_add_devices_dir() adds device link and devid kobject (sysfs entries) for a device or all the devices in the btrfs_fs_devices. In preparation to add these sysfs entries for the seed as well, add a btrfs_sysfs_add_device() helper function and avoid code duplication. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c6a5d95495 |
btrfs: fix replace of seed device
If you replace a seed device in a sprouted fs, it appears to have successfully replaced the seed device, but if you look closely, it didn't. Here is an example. $ mkfs.btrfs /dev/sda $ btrfstune -S1 /dev/sda $ mount /dev/sda /btrfs $ btrfs device add /dev/sdb /btrfs $ umount /btrfs $ btrfs device scan --forget $ mount -o device=/dev/sda /dev/sdb /btrfs $ btrfs replace start -f /dev/sda /dev/sdc /btrfs $ echo $? 0 BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc started BTRFS info (device sdb): dev_replace from /dev/sda (devid 1) to /dev/sdc finished $ btrfs fi show Label: none uuid: ab2c88b7-be81-4a7e-9849-c3666e7f9f4f Total devices 2 FS bytes used 256.00KiB devid 1 size 3.00GiB used 520.00MiB path /dev/sdc devid 2 size 3.00GiB used 896.00MiB path /dev/sdb Label: none uuid: 10bd3202-0415-43af-96a8-d5409f310a7e Total devices 1 FS bytes used 128.00KiB devid 1 size 3.00GiB used 536.00MiB path /dev/sda So as per the replace start command and kernel log replace was successful. Now let's try to clean mount. $ umount /btrfs $ btrfs device scan --forget $ mount -o device=/dev/sdc /dev/sdb /btrfs mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. [ 636.157517] BTRFS error (device sdc): failed to read chunk tree: -2 [ 636.180177] BTRFS error (device sdc): open_ctree failed That's because per dev items it is still looking for the original seed device. $ btrfs inspect-internal dump-tree -d /dev/sdb item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98 devid 1 total_bytes 3221225472 bytes_used 545259520 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 6 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 59368f50-9af2-4b17-91da-8a783cc418d4 <--- seed uuid fsid 10bd3202-0415-43af-96a8-d5409f310a7e <--- seed fsid item 1 key (DEV_ITEMS DEV_ITEM 2) itemoff 16087 itemsize 98 devid 2 total_bytes 3221225472 bytes_used 939524096 io_align 4096 io_width 4096 sector_size 4096 type 0 generation 0 start_offset 0 dev_group 0 seek_speed 0 bandwidth 0 uuid 56a0a6bc-4630-4998-8daf-3c3030c4256a <- sprout uuid fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f <- sprout fsid But the replaced target has the following uuid+fsid in its superblock which doesn't match with the expected uuid+fsid in its devitem. $ btrfs in dump-super /dev/sdc | egrep '^generation|dev_item.uuid|dev_item.fsid|devid' generation 20 dev_item.uuid 59368f50-9af2-4b17-91da-8a783cc418d4 dev_item.fsid ab2c88b7-be81-4a7e-9849-c3666e7f9f4f [match] dev_item.devid 1 So if you provide the original seed device the mount shall be successful. Which so long happening in the test case btrfs/163. $ btrfs device scan --forget $ mount -o device=/dev/sda /dev/sdb /btrfs Fix in this patch: If a seed is not sprouted then there is no replacement of it, because of its read-only filesystem with a read-only device. Similarly, in the case of a sprouted filesystem, the seed device is still read only. So, mark it as you can't replace a seed device, you can only add a new device and then delete the seed device. If replace is attempted then returns -EINVAL. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
79dae17d8d |
btrfs: improve device scanning messages
Systems booting without the initramfs seems to scan an unusual kind of device path (/dev/root). And at a later time, the device is updated to the correct path. We generally print the process name and PID of the process scanning the device but we don't capture the same information if the device path is rescanned with a different pathname. The current message is too long, so drop the unnecessary UUID and add process name and PID. While at this also update the duplicate device warning to include the process name and PID so the messages are consistent CC: stable@vger.kernel.org # 4.19+ Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=89721 Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
457f1864b5 |
btrfs: pretty print leaked root name
I'm a actual human being so am incapable of converting u64 to s64 in my head, so add a helper to get the pretty name of a root objectid and use that helper to spit out the name for any special roots for leaked roots, so I don't have to scratch my head and figure out which root I messed up the refs for. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
66a2823c54 |
btrfs: sysfs: export currently running exclusive operation
/sys/fs/<fsid>/exclusive_operation contains the currently executing exclusive operation. Add a sysfs_notify() when operation end, so userspace can be notified of exclusive operation is finished. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c3e1f96c37 |
btrfs: enumerate the type of exclusive operation in progress
Instead of using a flag bit for exclusive operation, use a variable to store which exclusive operation is being performed. Introduce an API to start and finish an exclusive operation. This would enable another way for tools to check which operation is running on why starting an exclusive operation failed. The followup patch adds a sysfs_notify() to alert userspace when the state changes, so userspace can perform select() on it to get notified of the change. This would enable us to enqueue a command which will wait for current exclusive operation to complete before issuing the next exclusive operation. This has been done synchronously as opposed to a background process, or else error collection (if any) will become difficult. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ca10845a56 |
btrfs: sysfs: init devices outside of the chunk_mutex
While running btrfs/061, btrfs/073, btrfs/078, or btrfs/178 we hit the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc3+ #4 Not tainted ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff96ecc22ef4a0 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc+0x37/0x270 alloc_inode+0x82/0xb0 iget_locked+0x10d/0x2c0 kernfs_get_inode+0x1b/0x130 kernfs_get_tree+0x136/0x240 sysfs_get_tree+0x16/0x40 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (kernfs_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 kernfs_add_one+0x23/0x150 kernfs_create_link+0x63/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x81/0x130 btrfs_init_new_device+0x67f/0x1250 btrfs_ioctl+0x1ef/0x2e20 __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_insert_empty_items+0x64/0xb0 btrfs_insert_delayed_items+0x90/0x4f0 btrfs_commit_inode_delayed_items+0x93/0x140 btrfs_log_inode+0x5de/0x2020 btrfs_log_inode_parent+0x429/0xc90 btrfs_log_new_name+0x95/0x9b btrfs_rename2+0xbb9/0x1800 vfs_rename+0x64f/0x9f0 do_renameat2+0x320/0x4e0 __x64_sys_rename+0x1f/0x30 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> kernfs_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(kernfs_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffff8dd74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8dd65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff96ed2ade30e0 (&type->s_umount_key#36){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 0 PID: 100 Comm: kswapd0 Not tainted 5.9.0-rc3+ #4 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x8b/0xb8 check_noncircular+0x12d/0x150 __lock_acquire+0x119c/0x1fc0 lock_acquire+0xa7/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa7/0x3d0 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x41/0x50 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This happens because we are holding the chunk_mutex at the time of adding in a new device. However we only need to hold the device_list_mutex, as we're going to iterate over the fs_devices devices. Move the sysfs init stuff outside of the chunk_mutex to get rid of this lockdep splat. CC: stable@vger.kernel.org # 4.4.x: f3cd2c58110dad14e: btrfs: sysfs, rename device_link add/remove functions CC: stable@vger.kernel.org # 4.4.x Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
facee0a09c |
btrfs: make extent_fiemap take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
948dfeb86b |
btrfs: make btrfs_zero_range_check_range_boundary take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
998acfe8ff |
btrfs: make copy_inline_to_page take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3c5641a83a |
btrfs: make btrfs_find_ordered_sum take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f1bbde8d5f |
btrfs: make get_extent_skip_holes take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3347c48f27 |
btrfs: make btrfs_writepage_endio_finish_ordered btrfs_inode-centric
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
53ac7ead24 |
btrfs: make btrfs_invalidatepage work on btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6fee248d2b |
btrfs: convert btrfs_inode_sectorsize to take btrfs_inode
It's counterintuitive to have a function named btrfs_inode_xxx which takes a generic inode. Also move the function to btrfs_inode.h so that it has access to the definition of struct btrfs_inode. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
90c0304c63 |
btrfs: make btrfs_dec_test_ordered_pending take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
acbf1dd0fc |
btrfs: make ordered extent tracepoint take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
6d072c8e29 |
btrfs: make btrfs_lookup_first_ordered_extent take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b79b724969 |
btrfs: make inode_tree_del take btrfs_inode
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ca9d473a3e |
btrfs: use BTRFS_NESTED_NEW_ROOT for double splits
I've made this change separate since it requires both of the newly added NESTED flags and I didn't want to slip it into one of those changes. If we do a double split of a node we can end up doing a BTRFS_NESTED_SPLIT on level 0, which throws lockdep off because it appears as a double lock. Since we're maxed out on subclasses, use BTRFS_NESTED_NEW_ROOT if we had to do a double split. This is OK because we won't have to do a double split if we had to insert a new root, and the new root would be at a higher level anyway. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cf6f34aa3a |
btrfs: introduce BTRFS_NESTING_NEW_ROOT for adding new roots
The way we add new roots is confusing from a locking perspective for lockdep. We generally have the rule that we lock things in order from highest level to lowest, but in the case of adding a new level to the tree we actually allocate a new block for the root, which makes the locking go in reverse. A similar issue exists for snapshotting, we cow the original root for the root of a new tree, however they're at the same level. Address this by using BTRFS_NESTING_NEW_ROOT for these operations. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4dff97e690 |
btrfs: introduce BTRFS_NESTING_SPLIT for split blocks
If we are splitting a leaf/node, we could do something like the
following
lock(leaf) BTRFS_NESTING_NORMAL
lock(left) BTRFS_NESTING_LEFT + BTRFS_NESTING_COW
push from leaf -> left
reset path to point to left
split left
allocate new block, lock block BTRFS_NESTING_SPLIT
at the new block point we need to have a different nesting level,
because we have already used either BTRFS_NESTING_LEFT or
BTRFS_NESTING_RIGHT when pushing items from the original leaf into the
adjacent leaves.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
bf59a5a216 |
btrfs: introduce BTRFS_NESTING_LEFT/RIGHT_COW
For similar reasons as BTRFS_NESTING_COW, we need
BTRFS_NESTING_LEFT/RIGHT_COW. The pattern is this
lock leaf -> BTRFS_NESTING_NORMAL
cow leaf -> BTRFS_NESTING_COW
split leaf
lock left -> BTRFS_NESTING_LEFT
cow left -> BTRFS_NESTING_LEFT_COW
We need this in order to indicate to lockdep that these locks are
discrete and are being taken in a safe order.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
bf77467a93 |
btrfs: introduce BTRFS_NESTING_LEFT/BTRFS_NESTING_RIGHT
Our lockdep maps are based on rootid+level, however in some cases we
will lock adjacent blocks on the same level, namely in searching forward
or in split/balance. Because of this lockdep will complain, so we need
a separate subclass to indicate to lockdep that these are different
locks.
lock leaf -> BTRFS_NESTING_NORMAL
cow leaf -> BTRFS_NESTING_COW
split leaf
lock left -> BTRFS_NESTING_LEFT
lock right -> BTRFS_NESTING_RIGHT
The above graph illustrates the need for this new nesting subclass.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
9631e4cc1a |
btrfs: introduce BTRFS_NESTING_COW for cow'ing blocks
When we COW a block we are holding a lock on the original block, and then we lock the new COW block. Because our lockdep maps are based on root + level, this will make lockdep complain. We need a way to indicate a subclass for locking the COW'ed block, so plumb through our btrfs_lock_nesting from btrfs_cow_block down to the btrfs_init_buffer, and then introduce BTRFS_NESTING_COW to be used for cow'ing blocks. The reason I've added all this extra infrastructure is because there will be need of different nesting classes in follow up patches. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fd7ba1c120 |
btrfs: add nesting tags to the locking helpers
We will need these when we switch to an rwsem, so plumb in the infrastructure here to use later on. I violate the 80 character limit some here because it'll be cleaned up later. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
51899412dd |
btrfs: introduce btrfs_path::recurse
Our current tree locking stuff allows us to recurse with read locks if we're already holding the write lock. This is necessary for the space cache inode, as we could be holding a lock on the root_tree root when we need to cache a block group, and thus need to be able to read down the root_tree to read in the inode cache. We can get away with this in our current locking, but we won't be able to with a rwsem. Handle this by purposefully annotating the places where we require recursion, so that in the future we can maybe come up with a way to avoid the recursion. In the case of the free space inode, this will be superseded by the free space tree. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
329ced799b |
btrfs: rename extent_buffer::lock_nested to extent_buffer::lock_recursed
Nested locking with lockdep and everything else refers to lock hierarchy within the same lock map. This is how we indicate the same locks for different objects are ok to take in a specific order, for our use case that would be to take the lock on a leaf and then take a lock on an adjacent leaf. What ->lock_nested _actually_ refers to is if we happen to already be holding the write lock on the extent buffer and we're allowing a read lock to be taken on that extent buffer, which is recursion. Rename this so we don't get confused when we switch to a rwsem and have to start using the _nested helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b9ba017fb0 |
btrfs: don't opencode sync_blockdev in btrfs_init_new_device
Instead of opencoding filemap_write_and_wait simply call syncblockdev as it makes it abundantly clear what's going on and why this is used. No semantics changes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4ae312e972 |
btrfs: remove redundant code from btrfs_free_stale_devices
Following the refactor of btrfs_free_stale_devices in |
|
|
|
44cab9ba37 |
btrfs: refactor locked condition in btrfs_init_new_device
Invert unlocked to locked and exploit the fact it can only ever be modified if we are adding a new device to a seed filesystem. This allows to simplify the check in error: label. No semantics changes. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f4cfa9bdd4 |
btrfs: use RCU for quick device check in btrfs_init_new_device
When adding a new device there's a mandatory check to see if a device is being duplicated to the filesystem it's added to. Since this is a read-only operations not necessary to take device_list_mutex and can simply make do with an rcu-readlock. Using just RCU is safe because there won't be another device add delete running in parallel as btrfs_init_new_device is called only from btrfs_ioctl_add_dev. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d16c702fe4 |
btrfs: ctree: check key order before merging tree blocks
[BUG] With a crafted image, btrfs can panic at btrfs_del_csums(): kernel BUG at fs/btrfs/ctree.c:3188! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1156 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:btrfs_set_item_key_safe+0x16c/0x180 RSP: 0018:ffff976141257ab8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff898a6b890930 RCX: 0000000004b70000 RDX: 0000000000000000 RSI: ffff976141257bae RDI: ffff976141257acf RBP: ffff976141257b10 R08: 0000000000001000 R09: ffff9761412579a8 R10: 0000000000000000 R11: 0000000000000000 R12: ffff976141257abe R13: 0000000000000003 R14: ffff898a6a8be578 R15: ffff976141257bae FS: 0000000000000000(0000) GS:ffff898a77a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f779d9cd624 CR3: 000000022b2b4006 CR4: 00000000000206f0 Call Trace: truncate_one_csum+0xac/0xf0 btrfs_del_csums+0x24f/0x3a0 __btrfs_free_extent.isra.72+0x5a7/0xbe0 __btrfs_run_delayed_refs+0x539/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace 93bf9db00e6c374e ]--- [CAUSE] This crafted image has a tricky key order corruption: checksum tree key (CSUM_TREE ROOT_ITEM 0) node 29741056 level 1 items 14 free 107 generation 19 owner CSUM_TREE ... key (EXTENT_CSUM EXTENT_CSUM 73785344) block 29757440 gen 19 key (EXTENT_CSUM EXTENT_CSUM 77594624) block 29753344 gen 19 ... leaf 29757440 items 5 free space 150 generation 19 owner CSUM_TREE item 0 key (EXTENT_CSUM EXTENT_CSUM 73785344) itemoff 2323 itemsize 1672 range start 73785344 end 75497472 length 1712128 item 1 key (EXTENT_CSUM EXTENT_CSUM 75497472) itemoff 2319 itemsize 4 range start 75497472 end 75501568 length 4096 item 2 key (EXTENT_CSUM EXTENT_CSUM 75501568) itemoff 579 itemsize 1740 range start 75501568 end 77283328 length 1781760 item 3 key (EXTENT_CSUM EXTENT_CSUM 77283328) itemoff 575 itemsize 4 range start 77283328 end 77287424 length 4096 item 4 key (EXTENT_CSUM EXTENT_CSUM 4120596480) itemoff 275 itemsize 300 <<< range start 4120596480 end 4120903680 length 307200 leaf 29753344 items 3 free space 1936 generation 19 owner CSUM_TREE item 0 key (18446744073457893366 EXTENT_CSUM 77594624) itemoff 2323 itemsize 1672 range start 77594624 end 79306752 length 1712128 ... Note the item 4 key of leaf 29757440, which is obviously too large, and even larger than the first key of the next leaf. However it still follows the key order in that tree block, thus tree checker is unable to detect it at read time, since tree checker can only work inside one leaf, thus such complex corruption can't be detected in advance. [FIX] The next time to detect such problem is at tree block merge time, which is in push_node_left(), balance_node_right(), push_leaf_left() or push_leaf_right(). Now we check if the key order of the right-most key of the left node is larger than the left-most key of the right node. By this we don't need to call the full tree-checker, while still keeping the key order correct as key order in each node is already checked by tree checker thus we only need to check the above two slots. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202833 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
07cce5cf3b |
btrfs: extent-tree: kill the BUG_ON() in insert_inline_extent_backref()
[BUG] With a crafted image, btrfs can panic at insert_inline_extent_backref(): kernel BUG at fs/btrfs/extent-tree.c:1857! invalid opcode: 0000 [#1] SMP PTI CPU: 0 PID: 1117 Comm: btrfs-transacti Not tainted 5.0.0-rc8+ #9 RIP: 0010:insert_inline_extent_backref+0xcc/0xe0 RSP: 0018:ffffac4dc1287be8 EFLAGS: 00010293 RAX: 0000000000000000 RBX: 0000000000000007 RCX: 0000000000000001 RDX: 0000000000001000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffac4dc1287c28 R08: ffffac4dc1287ab8 R09: ffffac4dc1287ac0 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff8febef88a540 R14: ffff8febeaa7bc30 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff8febf7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f663ace94c0 CR3: 0000000235698006 CR4: 00000000000206f0 Call Trace: ? _cond_resched+0x1a/0x50 __btrfs_inc_extent_ref.isra.64+0x7e/0x240 ? btrfs_merge_delayed_refs+0xa5/0x330 __btrfs_run_delayed_refs+0x653/0x1120 btrfs_run_delayed_refs+0xdb/0x1b0 btrfs_commit_transaction+0x52/0x950 ? start_transaction+0x94/0x450 transaction_kthread+0x163/0x190 kthread+0x105/0x140 ? btrfs_cleanup_transaction+0x560/0x560 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 Modules linked in: ---[ end trace 2ad8b3de903cf825 ]--- [CAUSE] Due to extent tree corruption (still valid by itself, but bad cross ref), we can allocate an extent which is still in extent tree. The offending tree block of that case is from csum tree. The newly allocated tree block is also for csum tree. Then we will try to insert a tree block ref for the existing tree block ref. For a tree extent item, tree block can never be shared directly by the same tree twice. We have such BUG_ON() to prevent such problem, but this is not a proper error handling. [FIX] Replace that BUG_ON() with proper error message and leaf dump for debug build. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202829 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1c2a07f598 |
btrfs: extent-tree: kill BUG_ON() in __btrfs_free_extent()
__btrfs_free_extent() is doing two things: 1. Reduce the refs number of an extent backref Either it's an inline extent backref (inside EXTENT/METADATA item) or a keyed extent backref (SHARED_* item). We only need to locate that backref line, either reduce the number or remove the backref line completely. 2. Update the refs count in EXTENT/METADATA_ITEM During step 1), we will try to locate the EXTENT/METADATA_ITEM without triggering another btrfs_search_slot() as fast path. Only when we fail to locate that item, we will trigger another btrfs_search_slot() to get that EXTENT/METADATA_ITEM after we updated/deleted the backref line. And we have a lot of strict checks on things like refs_to_drop against extent refs and special case checks for single ref extents. There are 7 BUG_ON()s, although they're doing correct checks, they can be triggered by crafted images. This patch improves the function: - Introduce two examples to show what __btrfs_free_extent() is doing One inline backref case and one keyed case. Should cover most cases. - Kill all BUG_ON()s with proper error message and optional leaf dump - Add comment to show the overall flow Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202819 [ The report triggers one BUG_ON() in __btrfs_free_extent() ] Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f98b6215d7 |
btrfs: extent_io: do extra check for extent buffer read write functions
Although we have start, len check for extent buffer reader/write (e.g. read_extent_buffer()), these checks have limitations: - No overflow check Values like start = 1024 len = -1024 can still pass the basic (start + len) > eb->len check. - Checks are not consistent For read_extent_buffer() we only check (start + len) against eb->len. While for memcmp_extent_buffer() we also check start against eb->len. - Different error reporting mechanism We use WARN() in read_extent_buffer() but BUG() in memcpy_extent_buffer(). - Still modify memory if the request is obviously wrong In read_extent_buffer() even we find (start + len) > eb->len, we still call memset(dst, 0, len), which can easily cause memory access error if start + len overflows. To address above problems, this patch creates a new common function to check such access, check_eb_range(). - Add overflow check This function checks start, start + len against eb->len and overflow check. - Unified checks - Unified error reports Will call WARN() if CONFIG_BTRFS_DEBUG is configured. And also do btrfs_warn() message for non-debug build. - Exit ASAP if check fails No more possible memory corruption. - Add extra comment for @start @len used in those functions as it's sometimes confused with the logical addressing instead of a range inside the eb space Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202817 [ Inspired by above report, the report itself is already addressed ] Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ use check_add_overflow ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
217f5004fe |
btrfs: rework error detection in init_tree_roots
To avoid duplicating 3 lines of code the error detection logic in init_tree_roots is somewhat quirky. It first checks for the presence of any error condition, then checks for the specific condition to perform any specific actions. That's spurious because directly checking for each respective error condition and doing the necessary steps is more obvious. While at it change the -EUCLEAN to -EIO in case the extent buffer is not read correctly, this is in line with other sites which return -EIO when the eb couldn't be read. Additionally it results in smaller code and the code reads more linearly: add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-95 (-95) Function old new delta open_ctree 17243 17148 -95 Total: Before=113104, After=113009, chg -0.08% Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e85fde5162 |
btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations
[BUG]
When quota is enabled for TEST_DEV, generic/013 sometimes fails like this:
generic/013 14s ... _check_dmesg: something found in dmesg (see xfstests-dev/results//generic/013.dmesg)
And with the following metadata leak:
BTRFS warning (device dm-3): qgroup 0/1370 has unreleased space, type 2 rsv 49152
------------[ cut here ]------------
WARNING: CPU: 2 PID: 47912 at fs/btrfs/disk-io.c:4078 close_ctree+0x1dc/0x323 [btrfs]
Call Trace:
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
__prepare_exit_to_usermode+0x1bc/0x1c0
__syscall_return_slowpath+0x47/0x230
do_syscall_64+0x64/0xb0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace a6cfd45ba80e4e06 ]---
BTRFS error (device dm-3): qgroup reserved space leaked
BTRFS info (device dm-3): disk space caching is enabled
BTRFS info (device dm-3): has skinny extents
[CAUSE]
The qgroup preallocated meta rsv operations of that offending root are:
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_delayed_inode_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=131072
btrfs_subvolume_reserve_metadata: rsv_meta_prealloc root=1370 num_bytes=49152
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
btrfs_delayed_inode_release_metadata: convert_meta_prealloc root=1370 num_bytes=-131072
It's pretty obvious that, we reserve qgroup meta rsv in
btrfs_subvolume_reserve_metadata(), but doesn't have corresponding
release/convert calls in btrfs_subvolume_release_metadata().
This leads to the leakage.
[FIX]
To fix this bug, we should follow what we're doing in
btrfs_delalloc_reserve_metadata(), where we reserve qgroup space, and
add it to block_rsv->qgroup_rsv_reserved.
And free the qgroup reserved metadata space when releasing the
block_rsv.
To do this, we need to change the btrfs_subvolume_release_metadata() to
accept btrfs_root, and record the qgroup_to_release number, and call
btrfs_qgroup_convert_reserved_meta() for it.
Fixes:
|
|
|
|
b4c5d8fdff |
btrfs: qgroup: fix wrong qgroup metadata reserve for delayed inode
For delayed inode facility, qgroup metadata is reserved for it, and
later freed.
However we're freeing more bytes than we reserved.
In btrfs_delayed_inode_reserve_metadata():
num_bytes = btrfs_calc_metadata_size(fs_info, 1);
...
ret = btrfs_qgroup_reserve_meta_prealloc(root,
fs_info->nodesize, true);
...
if (!ret) {
node->bytes_reserved = num_bytes;
But in btrfs_delayed_inode_release_metadata():
if (qgroup_free)
btrfs_qgroup_free_meta_prealloc(node->root,
node->bytes_reserved);
else
btrfs_qgroup_convert_reserved_meta(node->root,
node->bytes_reserved);
This means, we're always releasing more qgroup metadata rsv than we have
reserved.
This won't trigger selftest warning, as btrfs qgroup metadata rsv has
extra protection against cases like quota enabled half-way.
But we still need to fix this problem any way.
This patch will use the same num_bytes for qgroup metadata rsv so we
could handle it correctly.
Fixes:
|
|
|
|
425c6ed648 |
btrfs: do not hold device_list_mutex when closing devices
The following lockdep splat ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00169-g87212851a027-dirty #929 Not tainted ------------------------------------------------------ fsstress/8739 is trying to acquire lock: ffff88bfd0eb0c90 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x43/0x70 but task is already holding lock: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #10 (sb_pagefaults){.+.+}-{0:0}: __sb_start_write+0x129/0x210 btrfs_page_mkwrite+0x6a/0x4a0 do_page_mkwrite+0x4d/0xc0 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 asm_exc_page_fault+0x1e/0x30 -> #9 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 perf_read+0x141/0x2c0 vfs_read+0xad/0x1b0 ksys_read+0x5f/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #8 (&cpuctx_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x88/0x150 perf_event_init+0x1db/0x20b start_kernel+0x3ae/0x53c secondary_startup_64+0xa4/0xb0 -> #7 (pmus_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x4f/0x150 cpuhp_invoke_callback+0xb1/0x900 _cpu_up.constprop.26+0x9f/0x130 cpu_up+0x7b/0xc0 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x71 kernel_init_freeable+0x110/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #6 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x39/0xb0 kmem_cache_create_usercopy+0x28/0x230 kmem_cache_create+0x12/0x20 bioset_init+0x15e/0x2b0 init_bio+0xa3/0xaa do_one_initcall+0x5a/0x2e0 kernel_init_freeable+0x1f4/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #5 (bio_slab_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 bioset_init+0xbc/0x2b0 __blk_alloc_queue+0x6f/0x2d0 blk_mq_init_queue_data+0x1b/0x70 loop_add+0x110/0x290 [loop] fq_codel_tcf_block+0x12/0x20 [sch_fq_codel] do_one_initcall+0x5a/0x2e0 do_init_module+0x5a/0x220 load_module+0x2459/0x26e0 __do_sys_finit_module+0xba/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (loop_ctl_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 lo_open+0x18/0x50 [loop] __blkdev_get+0xec/0x570 blkdev_get+0xe8/0x150 do_dentry_open+0x167/0x410 path_openat+0x7c9/0xa80 do_filp_open+0x93/0x100 do_sys_openat2+0x22a/0x2e0 do_sys_open+0x4b/0x80 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 blkdev_put+0x1d/0x120 close_fs_devices.part.31+0x84/0x130 btrfs_close_devices+0x44/0xb0 close_ctree+0x296/0x2b2 generic_shutdown_super+0x69/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_run_dev_stats+0x49/0x480 commit_cowonly_roots+0xb5/0x2a0 btrfs_commit_transaction+0x516/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_commit_transaction+0x4bb/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __mutex_lock+0x9f/0x930 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 file_update_time+0xc8/0x110 btrfs_page_mkwrite+0x10c/0x4a0 do_page_mkwrite+0x4d/0xc0 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_pagefaults); lock(&mm->mmap_lock#2); lock(sb_pagefaults); lock(&fs_info->reloc_mutex); *** DEADLOCK *** 3 locks held by fsstress/8739: #0: ffff88bee66eeb68 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x173/0x660 #1: ffff88bfbd16e538 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x6a/0x4a0 #2: ffff88bfbd16e630 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3da/0x5d0 stack backtrace: CPU: 17 PID: 8739 Comm: fsstress Kdump: loaded Not tainted 5.8.0-rc7-00169-g87212851a027-dirty #929 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 ? btrfs_get_alloc_profile+0x150/0x210 lock_acquire+0x9e/0x360 ? btrfs_record_root_in_trans+0x43/0x70 __mutex_lock+0x9f/0x930 ? btrfs_record_root_in_trans+0x43/0x70 ? lock_acquire+0x9e/0x360 ? join_transaction+0x5d/0x450 ? find_held_lock+0x2d/0x90 ? btrfs_record_root_in_trans+0x43/0x70 ? join_transaction+0x3d5/0x450 ? btrfs_record_root_in_trans+0x43/0x70 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 file_update_time+0xc8/0x110 btrfs_page_mkwrite+0x10c/0x4a0 ? handle_mm_fault+0x5e/0x1730 do_page_mkwrite+0x4d/0xc0 ? __do_fault+0x32/0x150 handle_mm_fault+0x103c/0x1730 exc_page_fault+0x340/0x660 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7faa6c9969c4 Was seen in testing. The fix is similar to that of btrfs: open device without device_list_mutex where we're holding the device_list_mutex and then grab the bd_mutex, which pulls in a bunch of dependencies under the bd_mutex. We only ever call btrfs_close_devices() on mount failure or unmount, so we're save to not have the device_list_mutex here. We're already holding the uuid_mutex which keeps us safe from any external modification of the fs_devices. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
62cf539120 |
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our final blkdev_put() on the bdev, which will grab the bd_mutex. As such we want to be holding as few locks as possible, so move this call outside of the dev_replace->lock_finishing_cancel_unmount lock. Since we're modifying the fs_devices we need to make sure we're holding the uuid_mutex here, so take that as well. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
68abf36016 |
btrfs: remove alloc_list splice in btrfs_prepare_sprout
btrfs_prepare_sprout is called when the first rw device is added to a seed filesystem. This means the filesystem can't have its alloc_list be non-empty, since seed filesystems are read only. Simply remove the code altogether. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
427c8fddb1 |
btrfs: document some invariants of seed code
Without good understanding of how seed devices works it's hard to grok some of what the code in open_seed_devices or btrfs_prepare_sprout does. Add comments hopefully reducing some of the cognitive load. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
944d3f9fac |
btrfs: switch seed device to list api
While this patch touches a bunch of files the conversion is straighforward. Instead of using the implicit linked list anchored at btrfs_fs_devices::seed the code is switched to using list_for_each_entry. Previous patches in the series already factored out code that processed both main and seed devices so in those cases the factored out functions are called on the main fs_devices and then on every seed dev inside list_for_each_entry. Using list api also allows to simplify deletion from the seed dev list performed in btrfs_rm_device and btrfs_rm_dev_replace_free_srcdev by substituting a while() loop with a simple list_del_init. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c4989c2fd0 |
btrfs: simplify setting/clearing fs_info to btrfs_fs_devices
It makes no sense to have sysfs-related routines be responsible for properly initialising the fs_info pointer of struct btrfs_fs_device. Instead this can be streamlined by making it the responsibility of btrfs_init_devices_late to initialize it. That function already initializes fs_info of every individual device in btrfs_fs_devices. As far as clearing it is concerned it makes sense to move it to close_fs_devices. That function is only called when struct btrfs_fs_devices is no longer in use - either for holding seeds or main devices for a mounted filesystem. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
54eed6ae8d |
btrfs: make close_fs_devices return void
The return value of this function conveys absolutely no information. All callers already check the state of fs_devices->opened to decide how to proceed. So convert the function to returning void. While at it make btrfs_close_devices also return void. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3712ccb7f1 |
btrfs: factor out loop logic from btrfs_free_extra_devids
This prepares the code to switching seeds devices to a proper list. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dc0ab488d2 |
btrfs: factor out reada loop in __reada_start_machine
This is in preparation for moving fs_devices to proper lists. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1028d1c48b |
btrfs: remove err variable from btrfs_get_extent
There's no practical reason too use 'err' as a variable to convey
errors. In fact it's value is either set explicitly in the beginning of
the function or it simply takes the value of 'ret'. Not conforming to
the usual pattern of having ret be the only variable used to convey
errors makes the code more error prone to bugs. In fact one such bug
was introduced by
|
|
|
|
0eb79294db |
btrfs: dio iomap DSYNC workaround
iomap dio will run generic_write_sync() for us if the iocb is DSYNC. This is problematic for us because of 2 reasons: 1. we hold the inode_lock() during this operation, and we take it in generic_write_sync() 2. we hold a read lock on the dio_sem but take the write lock in fsync Since we don't want to rip out this code right now, but reworking the locking is a bit much to do at this point, work around this problem with this masterpiece of a patch. First, we clear DSYNC on the iocb so that the iomap stuff doesn't know that it needs to handle the sync. We save this fact in current->journal_info, because we need to see do special things once we're in iomap_begin, and we have no way to pass private information into iomap_dio_rw(). Next we specify a separate iomap_dio_ops for sync, which implements an ->end_io() callback that gets called when the dio completes. This is important for AIO, because we really do need to run generic_write_sync() if we complete asynchronously. However if we're still in the submitting context when we enter ->end_io() we clear the flag so that the submitter knows they're the ones that needs to run generic_write_sync(). This is meant to be temporary. We need to work out how to eliminate the inode_lock() and the dio_sem in our fsync and use another mechanism to protect these operations. Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f85781fb50 |
btrfs: switch to iomap for direct IO
We're using direct io implementation based on buffer heads. This patch switches to the new iomap infrastructure. Switch from __blockdev_direct_IO() to iomap_dio_rw(). Rename btrfs_get_blocks_direct() to btrfs_dio_iomap_begin() and use it as iomap_begin() for iomap direct I/O functions. This function allocates and locks all the blocks required for the I/O. btrfs_submit_direct() is used as the submit_io() hook for direct I/O ops. Since we need direct I/O reads to go through iomap_dio_rw(), we change file_operations.read_iter() to a btrfs_file_read_iter() which calls btrfs_direct_IO() for direct reads and falls back to generic_file_buffered_read() for incomplete reads and buffered reads. We don't need address_space.direct_IO() anymore: set it to noop. Similarly, we don't need flags used in __blockdev_direct_IO(). iomap is capable of direct I/O reads from a hole, so we don't need to return -ENOENT. Btrfs direct I/O is now done under i_rwsem, shared in case of reads and exclusive in case of writes. This guards against simultaneous truncates. Use iomap->iomap_end() to check for failed or incomplete direct I/O: - for writes, call __endio_write_update_ordered() - for reads, unlock extents btrfs_dio_data is now hooked in iomap->private and not current->journal_info. It carries the reservation variable and the amount of data submitted, so we can calculate the amount of data to call __endio_write_update_ordered in case of an error. This patch removes last use of struct buffer_head from btrfs. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
154f7cb868 |
btrfs: add owner and fs_info to alloc_state io_tree
Commit |
|
|
|
4c448ce8b4 |
btrfs: make read_block_group_item return void
Since it's inclusion on
|
|
|
|
24646481fb |
btrfs: sysfs: fix unused-but-set-variable warnings
The compilation with W=1 generates the following warnings:
fs/btrfs/sysfs.c:1630:6: warning: variable 'ret' set but not used [-Wunused-but-set-variable]
1630 | int ret;
| ^~~
fs/btrfs/sysfs.c:1629:6: warning: variable 'features' set but not used [-Wunused-but-set-variable]
1629 | u64 features;
| ^~~~~~~~
[ The unused variables are leftover from
|
|
|
|
487781796d |
btrfs: make fast fsyncs wait only for writeback
Currently regardless of a full or a fast fsync we always wait for ordered
extents to complete, and then start logging the inode after that. However
for fast fsyncs we can just wait for the writeback to complete, we don't
need to wait for the ordered extents to complete since we use the list of
modified extents maps to figure out which extents we must log and we can
get their checksums directly from the ordered extents that are still in
flight, otherwise look them up from the checksums tree.
Until commit
|
|
|
|
75b463d2b4 |
btrfs: do not commit logs and transactions during link and rename operations
Since commit
|
|
|
|
5522a27e59 |
btrfs: do not take the log_mutex of the subvolume when pinning the log
During a rename we pin the log to make sure no one commits a log that
reflects an ongoing rename operation, as it might result in a committed
log where it recorded the unlink of the old name without having recorded
the new name. However we are taking the subvolume's log_mutex before
incrementing the log_writers counter, which is not necessary since that
counter is atomic and we only remove the old name from the log and add
the new name to the log after we have incremented log_writers, ensuring
that no one can commit the log after we have removed the old name from
the log and before we added the new name to the log.
By taking the log_mutex lock we are just adding unnecessary contention on
the lock, which can become visible for workloads that mix renames with
fsyncs, writes for files opened with O_SYNC and unlink operations (if the
inode or its parent were fsynced before in the current transaction).
So just remove the lock and unlock of the subvolume's log_mutex at
btrfs_pin_log_trans().
Using dbench, which mixes different types of operations that end up taking
that mutex (fsyncs, renames, unlinks and writes into files opened with
O_SYNC) revealed some small gains. The following script that calls dbench
was used:
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/btrfs
MOUNT_OPTIONS="-o ssd -o space_cache=v2"
MKFS_OPTIONS="-m single -d single"
THREADS=32
echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV
mount $MOUNT_OPTIONS $DEV $MNT
dbench -s -t 600 -D $MNT $THREADS
umount $MNT
The test was run on bare metal, no virtualization, on a box with 12 cores
(Intel i7-8700), 64Gb of RAM and using a NVMe device, with a kernel
configuration that is the default of typical distributions (debian in this
case), without debug options enabled (kasan, kmemleak, slub debug, debug
of page allocations, lock debugging, etc).
Results before this patch:
Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 4410848 0.017 738.640
Close 3240222 0.001 0.834
Rename 186850 7.478 1272.476
Unlink 890875 0.128 785.018
Deltree 128 2.846 12.081
Mkdir 64 0.002 0.003
Qpathinfo 3997659 0.009 11.171
Qfileinfo 701307 0.001 0.478
Qfsinfo 733494 0.002 1.103
Sfileinfo 359362 0.004 3.266
Find
|
|
|
|
1b51d6fce4 |
btrfs: send: remove indirect callback parameter for changed_cb
There's a custom callback passed to btrfs_compare_trees which happens to be named exactly same as the existing function implementing it. This is confusing and the indirection is not necessary for our needs. Compiler is clever enough to call it directly so there's effectively no change. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8bb1cf1ba6 |
btrfs: scrub: rename ratelimit state varaible to avoid shadowing
There's already defined _rs within ctree.h:btrfs_printk_ratelimited, local variables should not use _ to avoid such name clashes with macro-local variables. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0af447d050 |
btrfs: remove unnecessarily shadowed variables
In btrfs_orphan_cleanup, there's another instance of fs_info, but it's the same as the one we already have. In btrfs_backref_finish_upper_links, rb_node is same type and used as temporary cursor to the tree. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cb4c919830 |
btrfs: compression: move declarations to header
The declarations of compression algorithm callbacks are defined in the .c file as they're used from there. Compiler warns that there are no declarations for public functions when compiling lzo.c/zlib.c/zstd.c. Fix that by moving the declarations to the header as it's the common place for all of them. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9e6df7cedf |
btrfs: remove const from btrfs_feature_set_name
The function btrfs_feature_set_name returns a const char pointer, the
second const is not necessary and reported as a warning:
In file included from fs/btrfs/space-info.c:6:
fs/btrfs/sysfs.h:16:1: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
16 | const char * const btrfs_feature_set_name(enum btrfs_feature_set set);
| ^~~~~
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
e21139c621 |
btrfs: cleanup calculation of lockend in lock_and_cleanup_extent_if_need()
We're just doing rounding up to sectorsize to calculate the lockend. There is no need to do the unnecessary length calculation, just direct round_up() is enough. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c4923027bd |
btrfs: fix possible infinite loop in data async reclaim
Dave reported an issue where generic/102 would sometimes hang. This turned out to be because we'd get into this spot where we were no longer making progress on data reservations because our exit condition was not met. The log is basically while (!space_info->full && !list_empty(&space_info->tickets)) flush_space(space_info, flush_state); where flush state is our various flush states, but doesn't include ALLOC_CHUNK_FORCE. This is because we actually lead with allocating chunks, and so the assumption was that once you got to the actual flushing states you could no longer allocate chunks. This was a stupid assumption, because you could have deleted block groups that would be reclaimed by a transaction commit, thus unsetting space_info->full. This is essentially what happens with generic/102, and so sometimes you'd get stuck in the flushing loop because we weren't allocating chunks, but flushing space wasn't giving us what we needed to make progress. Fix this by adding ALLOC_CHUNK_FORCE to the end of our flushing states, that way we will eventually bail out because we did end up with space_info->full if we free'd a chunk previously. Otherwise, as is the case for this test, we'll allocate our chunk and continue on our happy merry way. Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1a7a92c8dd |
btrfs: add a comment explaining the data flush steps
The data flushing steps are not obvious to people other than myself and Chris. Write a giant comment explaining the reasoning behind each flush step for data as well as why it is in that particular order. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5705674081 |
btrfs: do async reclaim for data reservations
Now that we have the data ticketing stuff in place, move normal data reservations to use an async reclaim helper to satisfy tickets. Before we could have multiple tasks race in and both allocate chunks, resulting in more data chunks than we would necessarily need. Serializing these allocations and making a single thread responsible for flushing will only allocate chunks as needed, as well as cut down on transaction commits and other flush related activities. Priority reservations will still work as they have before, simply trying to allocate a chunk until they can make their reservation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
cb3e393045 |
btrfs: flush delayed refs when trying to reserve data space
We can end up with freed extents in the delayed refs, and thus may_commit_transaction() may not think we have enough pinned space to commit the transaction and we'll ENOSPC early. Handle this by running the delayed refs in order to make sure pinned is uptodate before we try to commit the transaction. Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
327feeeb2e |
btrfs: run delayed iputs before committing the transaction for data
Before we were waiting on iputs after we committed the transaction, but this doesn't really make much sense. We want to reclaim any space we may have in order to be more likely to commit the transaction, due to pinned space being added by running the delayed iputs. Fix this by making delayed iputs run before committing the transaction. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bb86bd3db8 |
btrfs: don't force commit if we are data
We used to unconditionally commit the transaction at least 2 times and then on the 3rd try check against pinned space to make sure committing the transaction was worth the effort. This is overkill, we know nobody is going to steal our reservation, and if we can't make our reservation with the pinned amount simply bail out. This also cleans up the passing of bytes_needed to may_commit_transaction, as that was the thing we added into place in order to accomplish this behavior. We no longer need it so remove that mess. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0282700135 |
btrfs: drop the commit_cycles stuff for data reservations
This was an old wart left over from how we previously did data reservations. Before we could have people race in and take a reservation while we were flushing space, so we needed to make sure we looped a few times before giving up. Now that we're using the ticketing infrastructure we don't have to worry about this and can drop the logic altogether. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3bda421c1 |
btrfs: use the same helper for data and metadata reservations
Now that data reservations follow the same pattern as metadata reservations we can simply rename __reserve_metadata_bytes to __reserve_bytes and use that helper for data reservations. Things to keep in mind, btrfs_can_overcommit() returns 0 for data, because we can never overcommit. We also will never pass in FLUSH_ALL for data, so we'll simply be added to the priority list and go straight into handle_reserve_ticket. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0532a6f8b6 |
btrfs: serialize data reservations if we are flushing
Nikolay reported a problem where generic/371 would fail sometimes with a slow drive. The gist of the test is that we fallocate a file in parallel with a pwrite of a different file. These two files combined are smaller than the file system, but sometimes the pwrite would ENOSPC. A fair bit of investigation uncovered the fact that the fallocate workload was racing in and grabbing the free space that the pwrite workload was trying to free up so it could make its own reservation. After a few loops of this eventually the pwrite workload would error out with an ENOSPC. We've had the same problem with metadata as well, and we serialized all metadata allocations to satisfy this problem. This wasn't usually a problem with data because data reservations are more straightforward, but obviously could still happen. Fix this by not allowing reservations to occur if there are any pending tickets waiting to be satisfied on the space info. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
1004f6860f |
btrfs: use ticketing for data space reservations
Now that we have all the infrastructure in place, use the ticketing infrastructure to make data allocations. This still maintains the exact same flushing behavior, but now we're using tickets to get our reservations satisfied. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8698fc4eb7 |
btrfs: add btrfs_reserve_data_bytes and use it
Create a new function btrfs_reserve_data_bytes() in order to handle data reservations. This uses the new flush types and flush states to handle making data reservations. This patch specifically does not change any functionality, and is purposefully not cleaned up in order to make bisection easier for the future patches. The new helper is identical to the old helper in how it handles data reservations. We first try to force a chunk allocation, and then we run through the flush states all at once and in the same order that they were done with the old helper. Subsequent patches will clean this up and change the behavior of the flushing, and it is important to keep those changes separate so we can easily bisect down to the patch that caused the regression, rather than the patch that made us start using the new infrastructure. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a1ed0a8216 |
btrfs: add the data transaction commit logic into may_commit_transaction
Data space flushing currently unconditionally commits the transaction twice in a row, and the last time it checks if there's enough pinned extents to satisfy its reservation before deciding to commit the transaction for the 3rd and final time. Encode this logic into may_commit_transaction(). In the next patch we will pass in U64_MAX for bytes_needed the first two times, and the final time we will pass in the actual bytes we need so the normal logic will apply. This patch exists solely to make the logical changes I will make to the flushing state machine separate to make it easier to bisect any performance related regressions. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
058e6d1d26 |
btrfs: add flushing states for handling data reservations
Currently the way we do data reservations is by seeing if we have enough space in our space_info. If we do not and we're a normal inode we'll 1) Attempt to force a chunk allocation until we can't anymore. 2) If that fails we'll flush delalloc, then commit the transaction, then run the delayed iputs. If we are a free space inode we're only allowed to force a chunk allocation. In order to use the normal flushing mechanism we need to encode this into a flush state array for normal inodes. Since both will start with allocating chunks until the space info is full there is no need to add this as a flush state, this will be handled specially. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
448b966b49 |
btrfs: check tickets after waiting on ordered extents
Right now if the space is freed up after the ordered extents complete (which is likely since the reservations are held until they complete), we would do extra delalloc flushing before we'd notice that we didn't have any more tickets. Fix this by moving the tickets check after our wait_ordered_extents check. Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
38d715f494 |
btrfs: use btrfs_start_delalloc_roots in shrink_delalloc
The original iteration of flushing had us flushing delalloc and then checking to see if we could make our reservation, thus we were very careful about how many pages we would flush at once. But now that everything is async and we satisfy tickets as the space becomes available we don't have to keep track of any of this, simply try and flush the number of dirty inodes we may have in order to reclaim space to make our reservation. This cleans up our delalloc flushing significantly. The async_pages stuff is dropped because btrfs_start_delalloc_roots() handles the case that we generate async extents for us, so we no longer require this extra logic. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
39753e4a3a |
btrfs: use the btrfs_space_info_free_bytes_may_use helper for delalloc
We are going to use the ticket infrastructure for data, so use the btrfs_space_info_free_bytes_may_use() helper in btrfs_free_reserved_data_space_noquota() so we get the btrfs_try_granting_tickets call when we free our reservation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
99ffb43e5d |
btrfs: call btrfs_try_granting_tickets when reserving space
If we have compression on we could free up more space than we reserved, and thus be able to make a space reservation. Add the call for this scenario. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2732798c9b |
btrfs: call btrfs_try_granting_tickets when unpinning anything
When unpinning we were only calling btrfs_try_granting_tickets() if global_rsv->space_info == space_info, which is problematic because we use ticketing for SYSTEM chunks, and want to use it for DATA as well. Fix this by moving this call outside of that if statement. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3308234a7e |
btrfs: call btrfs_try_granting_tickets when freeing reserved bytes
We were missing a call to btrfs_try_granting_tickets in btrfs_free_reserved_bytes, so add it to handle the case where we're able to satisfy an allocation because we've freed a pending reservation. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c6c453032e |
btrfs: make ALLOC_CHUNK use the space info flags
We have traditionally used flush_space() to flush metadata space, so we've been unconditionally using btrfs_metadata_alloc_profile() for our profile to allocate a chunk. However if we're going to use this for data we need to use btrfs_get_alloc_profile() on the space_info we pass in. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
920a9958c2 |
btrfs: make shrink_delalloc take space_info as an arg
Currently shrink_delalloc just looks up the metadata space info, but this won't work if we're trying to reclaim space for data chunks. We get the right space_info we want passed into flush_space, so simply pass that along to shrink_delalloc. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d7f81fac97 |
btrfs: handle U64_MAX for shrink_delalloc
Data allocations are going to want to pass in U64_MAX for flushing space, adjust shrink_delalloc to handle this properly. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
288be2d997 |
btrfs: remove orig from shrink_delalloc
We don't use this anywhere inside of shrink_delalloc since
|
|
|
|
b49121393f |
btrfs: change nr to u64 in btrfs_start_delalloc_roots
We have btrfs_wait_ordered_roots() which takes a u64 for nr, but btrfs_start_delalloc_roots() that takes an int for nr, which makes using them in conjunction, especially for something like (u64)-1, annoying and inconsistent. Fix btrfs_start_delalloc_roots() to take a u64 for nr and adjust start_delalloc_inodes() and it's callers appropriately. This means we've adjusted start_delalloc_inodes() to take a pointer of nr since we want to preserve the ability for start-delalloc_inodes() to return an error, so simply make it do the nr adjusting as necessary. Part of adjusting the callers to this means changing btrfs_writeback_inodes_sb_nr() to take a u64 for items. This may be confusing because it seems unrelated, but the caller of btrfs_writeback_inodes_sb_nr() already passes in a u64, it's just the function variable that needs to be changed. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8e56008180 |
btrfs: remove fsid argument from btrfs_sysfs_update_sprout_fsid
It can be accessed from 'fs_devices' as it's identical to fs_info->fs_devices. Also add a comment about why we are calling the function. No semantic changes. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
57297c1e8e |
btrfs: remove spurious BUG_ON in btrfs_get_extent
That BUG_ON cannot ever trigger because as the comment there states - 'err' is always set. Simply remove it as it brings no value. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
260db43cd2 |
btrfs: delete duplicated words + other fixes in comments
Delete repeated words in fs/btrfs/.
{to, the, a, and old}
and change "into 2 part" to "into 2 parts".
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
437490fed3 |
btrfs: tracepoints: output proper root owner for trace_find_free_extent()
The current trace event always output result like this: find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA) find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=4(METADATA) find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA) find_free_extent: root=2(EXTENT_TREE) len=4096 empty_size=0 flags=1(DATA) T's saying we're allocating data extent for EXTENT tree, which is not even possible. It's because we always use EXTENT tree as the owner for trace_find_free_extent() without using the @root from btrfs_reserve_extent(). This patch will change the parameter to use proper @root for trace_find_free_extent(): Now it looks much better: find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP) find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=4096 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=8192 empty_size=0 flags=1(DATA) find_free_extent: root=5(FS_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP) find_free_extent: root=7(CSUM_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP) find_free_extent: root=2(EXTENT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP) find_free_extent: root=1(ROOT_TREE) len=16384 empty_size=0 flags=36(METADATA|DUP) Reported-by: Hans van Kranenburg <hans@knorrie.org> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4e3b9ce271 |
for-5.9-rc7-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl93REAACgkQxWXV+ddt WDv0/A//XYr1XLC/5sMILHqYZ4ogiFxC3Nfjeyt6vfBPX3J0d2eHnw5Rw+ZHHHdQ qtoKWom9ZwCxjybghwmvfxJuohy+6Sc764aEj+rYpUcCmmUZsAZZpmwpZqpYG+0H DEn9p45T0MO+r5lsF/GdNqqsdXZfUlZy7PweIhZucQxENM8cowklqKCo4AU2IEW4 203THU3UxQayn0um6kaiesioh8TtT+R9UVAyyA3n6lGINHKG8AMy0ulS/M2Uzgq5 eAzWne4Opy+wLxubBdeqruPiQrFQp+JV/YhTTEHGKRXykRYXwZnCDYdK27X4UKkt g3Ne0cEd/JuxZfb3Mzsd7+MF0xr9xKJPziFXv7YZt0LkiHE+B0b/DwA9FksR9sdO 4BY2oe0gztstIMqQ5qnriJMDQxonyUt2G65YW8sCI9b32vRYaHLhCWZRYzbmftEO W4FJOnAI2It3Ib0CUkBjkPYkmH113Q6g59k015IpoYRGmExhnC59zhuijdmthxFJ S5PXFymVhxt9iMOKM0jE17Rp/j4hVg/bdFVHJryzlOsldjq63Vukqoo24SQhiqfY qYn/Ilkc/h1YD/pxehFAhZcbGfEdjD5oo8OkGoKIUXfv35r7JH/5F/x+4DxZNnYk n0oHJ7WBR01AlHAcuTvsN7z9O2ZX6wZufkkgKYLBvtGtyC71T3A= =MT2i -----END PGP SIGNATURE----- Merge tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two more fixes. One is for a lockdep warning/lockup (also caught by syzbot), that one has been seen in practice. Regarding the other syzbot reports mentioned last time, they don't seem to be urgent and reliably reproducible so they'll be fixed later. The second fix is for a potential corruption when device replace finishes and the in-memory state of trim is not copied to the new device" * tag 'for-5.9-rc7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix filesystem corruption after a device replace btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing |
|
|
|
4c8f353272 |
btrfs: fix filesystem corruption after a device replace
We use a device's allocation state tree to track ranges in a device used
for allocated chunks, and we set ranges in this tree when allocating a new
chunk. However after a device replace operation, we were not setting the
allocated ranges in the new device's allocation state tree, so that tree
is empty after a device replace.
This means that a fitrim operation after a device replace will trim the
device ranges that have allocated chunks and extents, as we trim every
range for which there is not a range marked in the device's allocation
state tree. It is also important during chunk allocation, since the
device's allocation state is used to determine if a range is already
allocated when allocating a new chunk.
This is trivial to reproduce and the following script triggers the bug:
$ cat reproducer.sh
#!/bin/bash
DEV1="/dev/sdg"
DEV2="/dev/sdh"
DEV3="/dev/sdi"
wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null
# Create a raid1 test fs on 2 devices.
mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
mount $DEV1 /mnt/btrfs
xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo
echo "Starting to replace $DEV1 with $DEV3"
btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
echo
echo "Running fstrim"
fstrim /mnt/btrfs
echo
echo "Unmounting filesystem"
umount /mnt/btrfs
echo "Mounting filesystem in degraded mode using $DEV3 only"
wipefs -a $DEV1 $DEV2 &> /dev/null
mount -o degraded $DEV3 /mnt/btrfs
if [ $? -ne 0 ]; then
dmesg | tail
echo
echo "Failed to mount in degraded mode"
exit 1
fi
echo
echo "File foo data (expected all bytes = 0xab):"
od -A d -t x1 /mnt/btrfs/foo
umount /mnt/btrfs
When running the reproducer:
$ ./replace-test.sh
wrote 10485760/10485760 bytes at offset 0
10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
Starting to replace /dev/sdg with /dev/sdi
Running fstrim
Unmounting filesystem
Mounting filesystem in degraded mode using /dev/sdi only
mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
[19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
[19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
[19582.208293] BTRFS info (device sdi): allowing degraded mounts
[19582.208298] BTRFS info (device sdi): disk space caching is enabled
[19582.208301] BTRFS info (device sdi): has skinny extents
[19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
[19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
[19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
[19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
[19582.231576] BTRFS error (device sdi): open_ctree failed
Failed to mount in degraded mode
So fix by setting all allocated ranges in the replace target device when
the replace operation is finishing, when we are holding the chunk mutex
and we can not race with new chunk allocations.
A test case for fstests follows soon.
Fixes:
|
|
|
|
a466c85edc |
btrfs: move btrfs_rm_dev_replace_free_srcdev outside of all locks
When closing and freeing the source device we could end up doing our final blkdev_put() on the bdev, which will grab the bd_mutex. As such we want to be holding as few locks as possible, so move this call outside of the dev_replace->lock_finishing_cancel_unmount lock. Since we're modifying the fs_devices we need to make sure we're holding the uuid_mutex here, so take that as well. There's a report from syzbot probably hitting one of the cases where the bd_mutex and device_list_mutex are taken in the wrong order, however it's not with device replace, like this patch fixes. As there's no reproducer available so far, we can't verify the fix. https://lore.kernel.org/lkml/000000000000fc04d105afcf86d7@google.com/ dashboard link: https://syzkaller.appspot.com/bug?extid=84a0634dc5d21d488419 WARNING: possible circular locking dependency detected 5.9.0-rc5-syzkaller #0 Not tainted ------------------------------------------------------ syz-executor.0/6878 is trying to acquire lock: ffff88804c17d780 (&bdev->bd_mutex){+.+.}-{3:3}, at: blkdev_put+0x30/0x520 fs/block_dev.c:1804 but task is already holding lock: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #4 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 btrfs_finish_chunk_alloc+0x281/0xf90 fs/btrfs/volumes.c:5255 btrfs_create_pending_block_groups+0x2f3/0x700 fs/btrfs/block-group.c:2109 __btrfs_end_transaction+0xf5/0x690 fs/btrfs/transaction.c:916 find_free_extent_update_loop fs/btrfs/extent-tree.c:3807 [inline] find_free_extent+0x23b7/0x2e60 fs/btrfs/extent-tree.c:4127 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370 do_writepages+0xec/0x290 mm/page-writeback.c:2352 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894 wb_do_writeback fs/fs-writeback.c:2039 [inline] wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 -> #3 (sb_internal#2){.+.+}-{0:0}: percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write+0x234/0x470 fs/super.c:1672 sb_start_intwrite include/linux/fs.h:1690 [inline] start_transaction+0xbe7/0x1170 fs/btrfs/transaction.c:624 find_free_extent_update_loop fs/btrfs/extent-tree.c:3789 [inline] find_free_extent+0x25e1/0x2e60 fs/btrfs/extent-tree.c:4127 btrfs_reserve_extent+0x166/0x460 fs/btrfs/extent-tree.c:4206 cow_file_range+0x3de/0x9b0 fs/btrfs/inode.c:1063 btrfs_run_delalloc_range+0x2cf/0x1410 fs/btrfs/inode.c:1838 writepage_delalloc+0x150/0x460 fs/btrfs/extent_io.c:3439 __extent_writepage+0x441/0xd00 fs/btrfs/extent_io.c:3653 extent_write_cache_pages.constprop.0+0x69d/0x1040 fs/btrfs/extent_io.c:4249 extent_writepages+0xcd/0x2b0 fs/btrfs/extent_io.c:4370 do_writepages+0xec/0x290 mm/page-writeback.c:2352 __writeback_single_inode+0x125/0x1400 fs/fs-writeback.c:1461 writeback_sb_inodes+0x53d/0xf40 fs/fs-writeback.c:1721 wb_writeback+0x2ad/0xd40 fs/fs-writeback.c:1894 wb_do_writeback fs/fs-writeback.c:2039 [inline] wb_workfn+0x2dc/0x13e0 fs/fs-writeback.c:2080 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415 kthread+0x3b5/0x4a0 kernel/kthread.c:292 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 -> #2 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}: __flush_work+0x60e/0xac0 kernel/workqueue.c:3041 wb_shutdown+0x180/0x220 mm/backing-dev.c:355 bdi_unregister+0x174/0x590 mm/backing-dev.c:872 del_gendisk+0x820/0xa10 block/genhd.c:933 loop_remove drivers/block/loop.c:2192 [inline] loop_control_ioctl drivers/block/loop.c:2291 [inline] loop_control_ioctl+0x3b1/0x480 drivers/block/loop.c:2257 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (loop_ctl_mutex){+.+.}-{3:3}: __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 lo_open+0x19/0xd0 drivers/block/loop.c:1893 __blkdev_get+0x759/0x1aa0 fs/block_dev.c:1507 blkdev_get fs/block_dev.c:1639 [inline] blkdev_open+0x227/0x300 fs/block_dev.c:1753 do_dentry_open+0x4b9/0x11b0 fs/open.c:817 do_open fs/namei.c:3251 [inline] path_openat+0x1b9a/0x2730 fs/namei.c:3368 do_filp_open+0x17e/0x3c0 fs/namei.c:3395 do_sys_openat2+0x16d/0x420 fs/open.c:1168 do_sys_open fs/open.c:1184 [inline] __do_sys_open fs/open.c:1192 [inline] __se_sys_open fs/open.c:1188 [inline] __x64_sys_open+0x119/0x1c0 fs/open.c:1188 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&bdev->bd_mutex){+.+.}-{3:3}: check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 blkdev_put+0x30/0x520 fs/block_dev.c:1804 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline] btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline] btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline] close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149 generic_shutdown_super+0x144/0x370 fs/super.c:464 kill_anon_super+0x36/0x60 fs/super.c:1108 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265 deactivate_locked_super+0x94/0x160 fs/super.c:335 deactivate_super+0xad/0xd0 fs/super.c:366 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:163 [inline] exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &bdev->bd_mutex --> sb_internal#2 --> &fs_devs->device_list_mutex Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_devs->device_list_mutex); lock(sb_internal#2); lock(&fs_devs->device_list_mutex); lock(&bdev->bd_mutex); *** DEADLOCK *** 3 locks held by syz-executor.0/6878: #0: ffff88809070c0e0 (&type->s_umount_key#70){++++}-{3:3}, at: deactivate_super+0xa5/0xd0 fs/super.c:365 #1: ffffffff8a5b37a8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_close_devices+0x23/0x1f0 fs/btrfs/volumes.c:1178 #2: ffff8880908cfce0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: close_fs_devices.part.0+0x2e/0x800 fs/btrfs/volumes.c:1159 stack backtrace: CPU: 0 PID: 6878 Comm: syz-executor.0 Not tainted 5.9.0-rc5-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x198/0x1fd lib/dump_stack.c:118 check_noncircular+0x324/0x3e0 kernel/locking/lockdep.c:1827 check_prev_add kernel/locking/lockdep.c:2496 [inline] check_prevs_add kernel/locking/lockdep.c:2601 [inline] validate_chain kernel/locking/lockdep.c:3218 [inline] __lock_acquire+0x2a96/0x5780 kernel/locking/lockdep.c:4426 lock_acquire+0x1f3/0xae0 kernel/locking/lockdep.c:5006 __mutex_lock_common kernel/locking/mutex.c:956 [inline] __mutex_lock+0x134/0x10e0 kernel/locking/mutex.c:1103 blkdev_put+0x30/0x520 fs/block_dev.c:1804 btrfs_close_bdev fs/btrfs/volumes.c:1117 [inline] btrfs_close_bdev fs/btrfs/volumes.c:1107 [inline] btrfs_close_one_device fs/btrfs/volumes.c:1133 [inline] close_fs_devices.part.0+0x1a4/0x800 fs/btrfs/volumes.c:1161 close_fs_devices fs/btrfs/volumes.c:1193 [inline] btrfs_close_devices+0x95/0x1f0 fs/btrfs/volumes.c:1179 close_ctree+0x688/0x6cb fs/btrfs/disk-io.c:4149 generic_shutdown_super+0x144/0x370 fs/super.c:464 kill_anon_super+0x36/0x60 fs/super.c:1108 btrfs_kill_super+0x38/0x50 fs/btrfs/super.c:2265 deactivate_locked_super+0x94/0x160 fs/super.c:335 deactivate_super+0xad/0xd0 fs/super.c:366 cleanup_mnt+0x3a3/0x530 fs/namespace.c:1118 task_work_run+0xdd/0x190 kernel/task_work.c:141 tracehook_notify_resume include/linux/tracehook.h:188 [inline] exit_to_user_mode_loop kernel/entry/common.c:163 [inline] exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x460027 RSP: 002b:00007fff59216328 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 0000000000076035 RCX: 0000000000460027 RDX: 0000000000403188 RSI: 0000000000000002 RDI: 00007fff592163d0 RBP: 0000000000000333 R08: 0000000000000000 R09: 000000000000000b R10: 0000000000000005 R11: 0000000000000246 R12: 00007fff59217460 R13: 0000000002df2a60 R14: 0000000000000000 R15: 00007fff59217460 Signed-off-by: Josef Bacik <josef@toxicpanda.com> [ add syzbot reference ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
313b085851 |
btrfs: move btrfs_scratch_superblocks into btrfs_dev_replace_finishing
We need to move the closing of the src_device out of all the device replace locking, but we definitely want to zero out the superblock before we commit the last time to make sure the device is properly removed. Handle this by pushing btrfs_scratch_superblocks into btrfs_dev_replace_finishing, and then later on we'll move the src_device closing and freeing stuff where we need it to be. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ed7b6b4f6e |
bdi: remove BDI_CAP_CGROUP_WRITEBACK
Just checking SB_I_CGROUPWB for cgroup writeback support is enough. Either the file system allocates its own bdi (e.g. btrfs), in which case it is known to support cgroup writeback, or the bdi comes from the block layer, which always supports cgroup writeback. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
55b2598e84 |
bdi: initialize ->ra_pages and ->io_pages in bdi_init
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk> |
|
|
|
bffac4b543 |
for-5.9-rc6-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9q8PUACgkQxWXV+ddt WDsHZg//YF3Rfeo7/zaRsfUPvNoKDcM69TW+HROJXu4+rYlOukyuh5T+wboRU1Ft 7ymiR18idPYbtOczmH1Pqw+3wyOr39WafcvAnndoUguXJHsUrriBNqkthQICt0CG hUUiofedaB+j+ti7AYGhF/tkqjd8LkCj8SGEz4cSUFCheIHR+ajFwFmx1Sw6NGJV h9SdKfbBpIqIpoExFhprNFlxdaKN9rlhYY+zXZYeCBdU6r89CkuLqxZ79GzaU0N7 PG7FxuuJXvyHhta2a6p8hnEp7perOG22OTXJhzXd5JXiNCfZ/w4SfhH/aPO/3t5V x42hO+FvloVSLS3woZqkBsCgCIe0a3QOT0YxZiM+1cwSgg8mVw4UBEB3PIgkfOVT LawMbcgSh1evsSazru8gujm4f8RVxpSxxWfhhRwjXtyB8K89e22yBa9Lwfj04SH7 O5O7VrLDDnHsQWinsEf4Rl6byA13jUCgI5eUxZ5B7Au0Pm9uMexDh3lvgE0W0ucY UvD8qAetu2NNZD68gZp597uHPrwu+Lr+VumIh4wF6doeShlIkbf/d+ntOgW9ey1S WFSh7sUdKg5pVf6KJQ4yc3aBA6un5lv9LnvPJOwc9HyMUj/cYuxywxWf1YMr5umv 7/6CkufYjTAmEERQeqE1I6UIgUiWkS9nIisB8BLbYkrMR9Wi1bs= =tFUE -----END PGP SIGNATURE----- Merge tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "syzkaller started to hit us with reports, here's a fix for one type (stack overflow when printing checksums on read error). The other patch is a fix for sysfs object, we have a test for that and it leads to a crash." * tag 'for-5.9-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix put of uninitialized kobject after seed device delete btrfs: fix overflow when copying corrupt csums for a message |
|
|
|
b5ddcffa37 |
btrfs: fix put of uninitialized kobject after seed device delete
The following test case leads to NULL kobject free error:
mount seed /mnt
add sprout to /mnt
umount /mnt
mount sprout to /mnt
delete seed
kobject: '(null)' (00000000dd2b87e4): is not initialized, yet kobject_put() is being called.
WARNING: CPU: 1 PID: 15784 at lib/kobject.c:736 kobject_put+0x80/0x350
RIP: 0010:kobject_put+0x80/0x350
::
Call Trace:
btrfs_sysfs_remove_devices_dir+0x6e/0x160 [btrfs]
btrfs_rm_device.cold+0xa8/0x298 [btrfs]
btrfs_ioctl+0x206c/0x22a0 [btrfs]
ksys_ioctl+0xe2/0x140
__x64_sys_ioctl+0x1e/0x29
do_syscall_64+0x96/0x150
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4047c6288b
::
This is because, at the end of the seed device-delete, we try to remove
the seed's devid sysfs entry. But for the seed devices under the sprout
fs, we don't initialize the devid kobject yet. So add a kobject state
check, which takes care of the bug.
Fixes:
|
|
|
|
35be8851d1 |
btrfs: fix overflow when copying corrupt csums for a message
Syzkaller reported a buffer overflow in btree_readpage_end_io_hook()
when loop mounting a crafted image:
detected buffer overflow in memcpy
------------[ cut here ]------------
kernel BUG at lib/string.c:1129!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: btrfs-endio-meta btrfs_work_helper
RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
memcpy include/linux/string.h:405 [inline]
btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642
end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854
bio_endio+0x3cf/0x7f0 block/bio.c:1449
end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695
btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318
process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
kthread+0x3b5/0x4a0 kernel/kthread.c:292
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Modules linked in:
---[ end trace b68924293169feef ]---
RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
The overflow happens, because in btree_readpage_end_io_hook() we assume
that we have found a 4 byte checksum instead of the real possible 32
bytes we have for the checksums.
With the fix applied:
[ 35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215)
[ 35.738994] BTRFS info (device loop0): disk space caching is enabled
[ 35.738998] BTRFS info (device loop0): has skinny extents
[ 35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0
[ 35.743420] BTRFS error (device loop0): failed to read chunk root
[ 35.745899] BTRFS error (device loop0): open_ctree failed
Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com
Fixes:
|
|
|
|
fc4f28bb3d |
for-5.9-rc5-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9fju4ACgkQxWXV+ddt WDukWw//d3D33s1SnHI2ooub6VllM5xC/VCB5hfpHdeuLfhIR4iFrXlKCwFNuyuF A1CgKu7W4eMkQ9E6dKIvYhRQj9ELcHWxS1LRUkuXVGojvPVQAbGhUibjcmTy6xoi t4KkvdI0t/puEwRTrF2ooInWzdbiJvzCuCHLWkX4wK0IAkaVt+IK2ZJuSTCZhUEv Gx5CSLqP6zq/XD/03yWQvILE3MoM9BOc28EQ345dhEu+SWWpdhzy4W825A8xUe3Q bThcK8vJgeLUo/KVtUJNLoFTl+2BpxzaiHWqh+3plGjQ1qtZyDgFhZOoP0JEJb/f qEnyN2XbY26p6nj2J+34Euz5OX65ot6bePx6v/h5jY+UjNa8mpZD6d4Y+OII0XW0 kCLPyaNc8P6IkmAYzSLQPmrjCoRgR/2we8HRBylTWaeEcvpduZs/5YlJNO9q59Et voclNx6kk4AdFjSxXHwERlSA3v4+pkq41bjnbpv7a9FGRBUJGVfJFCNukuhCx5Oi nUKg7jBJEZNhE2DOA0MKnrZTBnufkLCiH+f/7pn+ypPdw1dna/ix9Suwbz4+UElx TtwGLF99CQCX09ieJBVWB90a3xgQBIZLaqu1HtZ1u7dzKkutl+uKMcnfEEtLSfJy qMLHHsRQj/iwytQHSdWboL6xQCUSkSefPTmmNfstaKSxeq6w0Q0= =pdp0 -----END PGP SIGNATURE----- Merge tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "One of the recent lockdep fixes introduced a bug that breaks the search ioctl, which is used by some applications (bees, compsize). The patch made it to stable trees so we need this fixup to make it work again" * tag 'for-5.9-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: fix wrong address when faulting in pages in the search ioctl |
|
|
|
1c78544eaa |
btrfs: fix wrong address when faulting in pages in the search ioctl
When faulting in the pages for the user supplied buffer for the search ioctl, we are passing only the base address of the buffer to the function fault_in_pages_writeable(). This means that after the first iteration of the while loop that searches for leaves, when we have a non-zero offset, stored in 'sk_offset', we try to fault in a wrong page range. So fix this by adding the offset in 'sk_offset' to the base address of the user supplied buffer when calling fault_in_pages_writeable(). Several users have reported that the applications compsize and bees have started to operate incorrectly since commit |
|
|
|
edf6b0e1e4 |
for-5.9-rc4-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9bj/UACgkQxWXV+ddt
WDs1HhAAgAvJVM2WJuMCjQhQiKljFjRT1a0Kbsp+9ayw5Q225t5S5kCMWrsA6mXF
/9bGRmmELm/Nr5pSH9hp5Bhbke0vNV+Y9XiRQXpegla4LLMF4MulVgADRIL3WoxO
ZAtNmZUokkjvB0CkzDuI7PqrF67TXLqV2hlctZo0p5SAFFgLaELyIYC6uAaO9Qo/
+EAAK+7oJyzWcUp44APu90wBbF79umwNVKEEkDfc6bwiA2Cut1JGzvPWgGvvQnta
fAd114LFViKg05GXcbnx4NxHYtf9tKHjDk9yYWssR+uV6vo/pWwAkDwYxXm/LzA4
Zv8QK5uvng1fW4eq9QkN3KflIDn+YhaH1jgwNcgyS+ZCdqZR1Mi949f+6Nj1fXt2
NeXOx3nhtqgNthKQNvHSMVJZrPjV3bdzOz+bULA+hMvTkr5gJy+ToAs30SLxGF5Y
BCJEE6b5M5Jnb+UHEBMuoxubBfmPHkY8LxfDzVWDLESsKcW2eYyeJyJXx4DNe/v9
O7Z5pcku+7R9LOlYQEzKeSuiYMqYLtmQtcNXyFBysksikjFJBWNgENna1LmgvmRH
j6fC5S9h4sIxzyKQkJgihIDt/a3f9WnhsoHw8EIn62tfdOIvMcT/xWq9YYgWaOjZ
H9040WXvEAFVcDn4cQ22DNgV+toJMpe0pLg6UXe7VtESUtbwMFM=
=JTfF
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- regression fix for a crash after failed snapshot creation
- one more lockep fix: use nofs allocation when allocating missing
device
- fix reloc tree leak on degraded mount
- make some extent buffer alignment checks less strict to mount
filesystems created by btrfs-convert"
* tag 'for-5.9-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix NULL pointer dereference after failure to create snapshot
btrfs: free data reloc tree on failed mount
btrfs: require only sector size alignment for parent eb bytenr
btrfs: fix lockdep splat in add_missing_dev
|
|
|
|
2d892ccdc1 |
btrfs: fix NULL pointer dereference after failure to create snapshot
When trying to get a new fs root for a snapshot during the transaction
at transaction.c:create_pending_snapshot(), if btrfs_get_new_fs_root()
fails we leave "pending->snap" pointing to an error pointer, and then
later at ioctl.c:create_snapshot() we dereference that pointer, resulting
in a crash:
[12264.614689] BUG: kernel NULL pointer dereference, address: 00000000000007c4
[12264.615650] #PF: supervisor write access in kernel mode
[12264.616487] #PF: error_code(0x0002) - not-present page
[12264.617436] PGD 0 P4D 0
[12264.618328] Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[12264.619150] CPU: 0 PID: 2310635 Comm: fsstress Tainted: G W 5.9.0-rc3-btrfs-next-67 #1
[12264.619960] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[12264.621769] RIP: 0010:btrfs_mksubvol+0x438/0x4a0 [btrfs]
[12264.622528] Code: bc ef ff ff (...)
[12264.624092] RSP: 0018:ffffaa6fc7277cd8 EFLAGS: 00010282
[12264.624669] RAX: 00000000fffffff4 RBX: ffff9d3e8f151a60 RCX: 0000000000000000
[12264.625249] RDX: 0000000000000001 RSI: ffffffff9d56c9be RDI: fffffffffffffff4
[12264.625830] RBP: ffff9d3e8f151b48 R08: 0000000000000000 R09: 0000000000000000
[12264.626413] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000fffffff4
[12264.626994] R13: ffff9d3ede380538 R14: ffff9d3ede380500 R15: ffff9d3f61b2eeb8
[12264.627582] FS: 00007f140d5d8200(0000) GS:ffff9d3fb5e00000(0000) knlGS:0000000000000000
[12264.628176] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[12264.628773] CR2: 00000000000007c4 CR3: 000000020f8e8004 CR4: 00000000003706f0
[12264.629379] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[12264.629994] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[12264.630594] Call Trace:
[12264.631227] btrfs_mksnapshot+0x7b/0xb0 [btrfs]
[12264.631840] __btrfs_ioctl_snap_create+0x16f/0x1a0 [btrfs]
[12264.632458] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[12264.633078] btrfs_ioctl+0x1864/0x3130 [btrfs]
[12264.633689] ? do_sys_openat2+0x1a7/0x2d0
[12264.634295] ? kmem_cache_free+0x147/0x3a0
[12264.634899] ? __x64_sys_ioctl+0x83/0xb0
[12264.635488] __x64_sys_ioctl+0x83/0xb0
[12264.636058] do_syscall_64+0x33/0x80
[12264.636616] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(gdb) list *(btrfs_mksubvol+0x438)
0x7c7b8 is in btrfs_mksubvol (fs/btrfs/ioctl.c:858).
853 ret = 0;
854 pending_snapshot->anon_dev = 0;
855 fail:
856 /* Prevent double freeing of anon_dev */
857 if (ret && pending_snapshot->snap)
858 pending_snapshot->snap->anon_dev = 0;
859 btrfs_put_root(pending_snapshot->snap);
860 btrfs_subvolume_release_metadata(root, &pending_snapshot->block_rsv);
861 free_pending:
862 if (pending_snapshot->anon_dev)
So fix this by setting "pending->snap" to NULL if we get an error from the
call to btrfs_get_new_fs_root() at transaction.c:create_pending_snapshot().
Fixes:
|
|
|
|
9e3aa80544 |
btrfs: free data reloc tree on failed mount
While testing a weird problem with -o degraded, I noticed I was getting leaked root errors BTRFS warning (device loop0): writable mount is not allowed due to too many missing devices BTRFS error (device loop0): open_ctree failed BTRFS error (device loop0): leaked root -9-0 refcount 1 This is the DATA_RELOC root, which gets read before the other fs roots, but is included in the fs roots radix tree. Handle this by adding a btrfs_drop_and_free_fs_root() on the data reloc root if it exists. This is ok to do here if we fail further up because we will only drop the ref if we delete the root from the radix tree, and all other cleanup won't be duplicated. CC: stable@vger.kernel.org # 5.8+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ea57788eb7 |
btrfs: require only sector size alignment for parent eb bytenr
[BUG] A completely sane converted fs will cause kernel warning at balance time: [ 1557.188633] BTRFS info (device sda7): relocating block group 8162107392 flags data [ 1563.358078] BTRFS info (device sda7): found 11722 extents [ 1563.358277] BTRFS info (device sda7): leaf 7989321728 gen 95 total ptrs 213 free space 3458 owner 2 [ 1563.358280] item 0 key (7984947200 169 0) itemoff 16250 itemsize 33 [ 1563.358281] extent refs 1 gen 90 flags 2 [ 1563.358282] ref#0: tree block backref root 4 [ 1563.358285] item 1 key (7985602560 169 0) itemoff 16217 itemsize 33 [ 1563.358286] extent refs 1 gen 93 flags 258 [ 1563.358287] ref#0: shared block backref parent 7985602560 [ 1563.358288] (parent 7985602560 is NOT ALIGNED to nodesize 16384) [ 1563.358290] item 2 key (7985635328 169 0) itemoff 16184 itemsize 33 ... [ 1563.358995] BTRFS error (device sda7): eb 7989321728 invalid extent inline ref type 182 [ 1563.358996] ------------[ cut here ]------------ [ 1563.359005] WARNING: CPU: 14 PID: 2930 at 0xffffffff9f231766 Then with transaction abort, and obviously failed to balance the fs. [CAUSE] That mentioned inline ref type 182 is completely sane, it's BTRFS_SHARED_BLOCK_REF_KEY, it's some extra check making kernel to believe it's invalid. Commit |
|
|
|
fccc0007b8 |
btrfs: fix lockdep splat in add_missing_dev
Nikolay reported a lockdep splat in generic/476 that I could reproduce with btrfs/187. ====================================================== WARNING: possible circular locking dependency detected 5.9.0-rc2+ #1 Tainted: G W ------------------------------------------------------ kswapd0/100 is trying to acquire lock: ffff9e8ef38b6268 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x330 but task is already holding lock: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: fs_reclaim_acquire+0x65/0x80 slab_pre_alloc_hook.constprop.0+0x20/0x200 kmem_cache_alloc_trace+0x3a/0x1a0 btrfs_alloc_device+0x43/0x210 add_missing_dev+0x20/0x90 read_one_chunk+0x301/0x430 btrfs_read_sys_array+0x17b/0x1b0 open_ctree+0xa62/0x1896 btrfs_mount_root.cold+0x12/0xea legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x10d/0x379 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 path_mount+0x434/0xc00 __x64_sys_mount+0xe3/0x120 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x7e/0x7e0 btrfs_chunk_alloc+0x125/0x3a0 find_free_extent+0xdf6/0x1210 btrfs_reserve_extent+0xb3/0x1b0 btrfs_alloc_tree_block+0xb0/0x310 alloc_tree_block_no_bg_flush+0x4a/0x60 __btrfs_cow_block+0x11a/0x530 btrfs_cow_block+0x104/0x220 btrfs_search_slot+0x52e/0x9d0 btrfs_lookup_inode+0x2a/0x8f __btrfs_update_delayed_inode+0x80/0x240 btrfs_commit_inode_delayed_inode+0x119/0x120 btrfs_evict_inode+0x357/0x500 evict+0xcf/0x1f0 vfs_rmdir.part.0+0x149/0x160 do_rmdir+0x136/0x1a0 do_syscall_64+0x33/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 __mutex_lock+0x7e/0x7e0 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 kthread+0x138/0x160 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/100: #0: ffffffffa9d74700 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffffa9d65c50 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x115/0x290 #2: ffff9e8e9da260e0 (&type->s_umount_key#48){++++}-{3:3}, at: super_cache_scan+0x38/0x1e0 stack backtrace: CPU: 1 PID: 100 Comm: kswapd0 Tainted: G W 5.9.0-rc2+ #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x12d/0x150 __lock_acquire+0x1184/0x1fa0 lock_acquire+0xa4/0x3d0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 __mutex_lock+0x7e/0x7e0 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? __btrfs_release_delayed_node.part.0+0x3f/0x330 ? lock_acquire+0xa4/0x3d0 ? btrfs_evict_inode+0x11e/0x500 ? find_held_lock+0x2b/0x80 __btrfs_release_delayed_node.part.0+0x3f/0x330 btrfs_evict_inode+0x24c/0x500 evict+0xcf/0x1f0 dispose_list+0x48/0x70 prune_icache_sb+0x44/0x50 super_cache_scan+0x161/0x1e0 do_shrink_slab+0x178/0x3c0 shrink_slab+0x17c/0x290 shrink_node+0x2b2/0x6d0 balance_pgdat+0x30a/0x670 kswapd+0x213/0x4c0 ? _raw_spin_unlock_irqrestore+0x46/0x60 ? add_wait_queue_exclusive+0x70/0x70 ? balance_pgdat+0x670/0x670 kthread+0x138/0x160 ? kthread_create_worker_on_cpu+0x40/0x40 ret_from_fork+0x1f/0x30 This is because we are holding the chunk_mutex when we call btrfs_alloc_device, which does a GFP_KERNEL allocation. We don't want to switch that to a GFP_NOFS lock because this is the only place where it matters. So instead use memalloc_nofs_save() around the allocation in order to avoid the lockdep splat. Reported-by: Nikolay Borisov <nborisov@suse.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
dcdfd9cc28 |
for-5.9-rc3-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9NWskACgkQxWXV+ddt WDsO5A//Xcjo45Th0T0mTiPXWpCPexOmQd3yvEezME5TVavEKM9hkfxYrv5d5i/k dfFhPg8FYSasW7Sie1PdcP0Hu9mlla5G/4L1pQYMewCdJCSx0mt/dDtEh4GPwcmI 1MgyQ0pnKufNBD9SquO4yGZHR+lP+wY/FTZ6yCePJRvNEvtOWtcxH3vhTrbKC0sc wSO2xL8Ht3VQBDA6YR+A/OMY8t6k2aN0qOk1BJWcMil0HmNw/rZBH/bWvPOn0dMQ vaXQIUNRIUUD5qS7vcv5rgVDDWMnyM41eC6idckg1A2tpLGYZyFO3fv2c/VNKFaZ dWQGeSbg0wRMnQhHhDmAZZwRfTsOhGuSPADnh4qQ07klDto6s5EbmrwkWmGNC6QY B9yGF09mNwcHvgjz6RVv3mA/dINP32Qg3s90TtO2b7Rrolg1sQLRYDewuTUukmvH jIg8q1oZBppHS8y5B4Vxl+18Swz4j66Kurq9jhU669n0CUqYBiqg5dEQzDHm54Ca uEujNWfUGzfiFbj3uVtMG0i9XSHlwEtiGOCkXErlEqAFRSzPQsoFJQ0ae3Hl9FiI PJuTVmS1UekhCB/ubewf5dOHbEk/nRKbIHnXbuJTQcWWzGtzzyBv4uJH3arQcMrC QdiBpbqR1BSQ6ki4GbnG9f/4mvnLffsfPDAh+dt9VGxEru/9Q/c= =cdvW -----END PGP SIGNATURE----- Merge tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "Two small fixes and a bunch of lockdep fixes for warnings that show up with an upcoming tree locking update but are valid with current locks as well" * tag 'for-5.9-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: tree-checker: fix the error message for transid error btrfs: set the lockdep class for log tree extent buffers btrfs: set the correct lockdep class for new nodes btrfs: allocate scrub workqueues outside of locks btrfs: fix potential deadlock in the search ioctl btrfs: drop path before adding new uuid tree entry btrfs: block-group: fix free-space bitmap threshold |
|
|
|
f96d6960ab |
btrfs: tree-checker: fix the error message for transid error
The error message for inode transid is the same as for inode generation,
which makes us unable to detect the real problem.
Reported-by: Tyler Richmond <t.d.richmond@gmail.com>
Fixes:
|
|
|
|
d3beaa253f |
btrfs: set the lockdep class for log tree extent buffers
These are special extent buffers that get rewound in order to lookup the state of the tree at a specific point in time. As such they do not go through the normal initialization paths that set their lockdep class, so handle them appropriately when they are created and before they are locked. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
ad24466588 |
btrfs: set the correct lockdep class for new nodes
When flipping over to the rw_semaphore I noticed I'd get a lockdep splat in replace_path(), which is weird because we're swapping the reloc root with the actual target root. Turns out this is because we're using the root->root_key.objectid as the root id for the newly allocated tree block when setting the lockdep class, however we need to be using the actual owner of this new block, which is saved in owner. The affected path is through btrfs_copy_root as all other callers of btrfs_alloc_tree_block (which calls init_new_buffer) have root_objectid == root->root_key.objectid . CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e89c4a9c8e |
btrfs: allocate scrub workqueues outside of locks
I got the following lockdep splat while testing: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00172-g021118712e59 #932 Not tainted ------------------------------------------------------ btrfs/229626 is trying to acquire lock: ffffffff828513f0 (cpu_hotplug_lock){++++}-{0:0}, at: alloc_workqueue+0x378/0x450 but task is already holding lock: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #7 (&fs_info->scrub_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_scrub_dev+0x11c/0x630 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #6 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_run_dev_stats+0x49/0x480 commit_cowonly_roots+0xb5/0x2a0 btrfs_commit_transaction+0x516/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #5 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_commit_transaction+0x4bb/0xa60 sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0xe/0x30 btrfs_kill_super+0x12/0x20 deactivate_locked_super+0x29/0x60 cleanup_mnt+0xb8/0x140 task_work_run+0x6d/0xb0 __prepare_exit_to_usermode+0x1cc/0x1e0 do_syscall_64+0x5c/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (&fs_info->reloc_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_record_root_in_trans+0x43/0x70 start_transaction+0xd1/0x5d0 btrfs_dirty_inode+0x42/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 perf_read+0x141/0x2c0 vfs_read+0xad/0x1b0 ksys_read+0x5f/0xe0 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&cpuctx_mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x88/0x150 perf_event_init+0x1db/0x20b start_kernel+0x3ae/0x53c secondary_startup_64+0xa4/0xb0 -> #1 (pmus_lock){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 perf_event_init_cpu+0x4f/0x150 cpuhp_invoke_callback+0xb1/0x900 _cpu_up.constprop.26+0x9f/0x130 cpu_up+0x7b/0xc0 bringup_nonboot_cpus+0x4f/0x60 smp_init+0x26/0x71 kernel_init_freeable+0x110/0x258 kernel_init+0xa/0x103 ret_from_fork+0x1f/0x30 -> #0 (cpu_hotplug_lock){++++}-{0:0}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 cpus_read_lock+0x39/0xb0 alloc_workqueue+0x378/0x450 __btrfs_alloc_workqueue+0x15d/0x200 btrfs_alloc_workqueue+0x51/0x160 scrub_workers_get+0x5a/0x170 btrfs_scrub_dev+0x18c/0x630 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: cpu_hotplug_lock --> &fs_devs->device_list_mutex --> &fs_info->scrub_lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->scrub_lock); lock(&fs_devs->device_list_mutex); lock(&fs_info->scrub_lock); lock(cpu_hotplug_lock); *** DEADLOCK *** 2 locks held by btrfs/229626: #0: ffff88bfe8bb86e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_scrub_dev+0xbd/0x630 #1: ffff889dd3889518 (&fs_info->scrub_lock){+.+.}-{3:3}, at: btrfs_scrub_dev+0x11c/0x630 stack backtrace: CPU: 15 PID: 229626 Comm: btrfs Kdump: loaded Not tainted 5.8.0-rc7-00172-g021118712e59 #932 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? alloc_workqueue+0x378/0x450 cpus_read_lock+0x39/0xb0 ? alloc_workqueue+0x378/0x450 alloc_workqueue+0x378/0x450 ? rcu_read_lock_sched_held+0x52/0x80 __btrfs_alloc_workqueue+0x15d/0x200 btrfs_alloc_workqueue+0x51/0x160 scrub_workers_get+0x5a/0x170 btrfs_scrub_dev+0x18c/0x630 ? start_transaction+0xd1/0x5d0 btrfs_dev_replace_by_ioctl.cold.21+0x10a/0x1d4 btrfs_ioctl+0x2799/0x30a0 ? do_sigaction+0x102/0x250 ? lockdep_hardirqs_on_prepare+0xca/0x160 ? _raw_spin_unlock_irq+0x24/0x30 ? trace_hardirqs_on+0x1c/0xe0 ? _raw_spin_unlock_irq+0x24/0x30 ? do_sigaction+0x102/0x250 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This happens because we're allocating the scrub workqueues under the scrub and device list mutex, which brings in a whole host of other dependencies. Because the work queue allocation is done with GFP_KERNEL, it can trigger reclaim, which can lead to a transaction commit, which in turns needs the device_list_mutex, it can lead to a deadlock. A different problem for which this fix is a solution. Fix this by moving the actual allocation outside of the scrub lock, and then only take the lock once we're ready to actually assign them to the fs_info. We'll now have to cleanup the workqueues in a few more places, so I've added a helper to do the refcount dance to safely free the workqueues. CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a48b73eca4 |
btrfs: fix potential deadlock in the search ioctl
With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Not tainted ------------------------------------------------------ compsize/11122 is trying to acquire lock: ffff889fabca8768 (&mm->mmap_lock#2){++++}-{3:3}, at: __might_fault+0x3e/0x90 but task is already holding lock: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (btrfs-fs-00){++++}-{3:3}: down_write_nested+0x3b/0x70 __btrfs_tree_lock+0x24/0x120 btrfs_search_slot+0x756/0x990 btrfs_lookup_inode+0x3a/0xb4 __btrfs_update_delayed_inode+0x93/0x270 btrfs_async_run_delayed_root+0x168/0x230 btrfs_work_helper+0xd4/0x570 process_one_work+0x2ad/0x5f0 worker_thread+0x3a/0x3d0 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #1 (&delayed_node->mutex){+.+.}-{3:3}: __mutex_lock+0x9f/0x930 btrfs_delayed_update_inode+0x50/0x440 btrfs_update_inode+0x8a/0xf0 btrfs_dirty_inode+0x5b/0xd0 touch_atime+0xa1/0xd0 btrfs_file_mmap+0x3f/0x60 mmap_region+0x3a4/0x640 do_mmap+0x376/0x580 vm_mmap_pgoff+0xd5/0x120 ksys_mmap_pgoff+0x193/0x230 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&mm->mmap_lock#2){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 __might_fault+0x68/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &mm->mmap_lock#2 --> &delayed_node->mutex --> btrfs-fs-00 Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-fs-00); lock(&delayed_node->mutex); lock(btrfs-fs-00); lock(&mm->mmap_lock#2); *** DEADLOCK *** 1 lock held by compsize/11122: #0: ffff889fe720fe40 (btrfs-fs-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 17 PID: 11122 Comm: compsize Kdump: loaded Not tainted 5.8.0-rc7-00165-g04ec4da5f45f-dirty #922 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __might_fault+0x3e/0x90 ? find_held_lock+0x72/0x90 __might_fault+0x68/0x90 ? __might_fault+0x3e/0x90 _copy_to_user+0x1e/0x80 copy_to_sk.isra.32+0x121/0x300 ? btrfs_search_forward+0x2a6/0x360 search_ioctl+0x106/0x200 btrfs_ioctl_tree_search_v2+0x7b/0xf0 btrfs_ioctl+0x106f/0x30a0 ? __do_sys_newfstat+0x5a/0x70 ? ksys_ioctl+0x83/0xc0 ksys_ioctl+0x83/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 The problem is we're doing a copy_to_user() while holding tree locks, which can deadlock if we have to do a page fault for the copy_to_user(). This exists even without my locking changes, so it needs to be fixed. Rework the search ioctl to do the pre-fault and then copy_to_user_nofault for the copying. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
9771a5cf93 |
btrfs: drop path before adding new uuid tree entry
With the conversion of the tree locks to rwsem I got the following lockdep splat: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Not tainted ------------------------------------------------------ btrfs-uuid/7955 is trying to acquire lock: ffff88bfbafec0f8 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 but task is already holding lock: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (btrfs-uuid-00){++++}-{3:3}: down_read_nested+0x3e/0x140 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_uuid_tree_add+0x89/0x2d0 btrfs_uuid_scan_kthread+0x330/0x390 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 -> #0 (btrfs-root-00){++++}-{3:3}: __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 down_read_nested+0x3e/0x140 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_find_root+0x45/0x1b0 btrfs_read_tree_root+0x61/0x100 btrfs_get_root_ref.part.50+0x143/0x630 btrfs_uuid_tree_iterate+0x207/0x314 btrfs_uuid_rescan_kthread+0x12/0x50 kthread+0x133/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(btrfs-uuid-00); lock(btrfs-root-00); lock(btrfs-uuid-00); lock(btrfs-root-00); *** DEADLOCK *** 1 lock held by btrfs-uuid/7955: #0: ffff88bfbafef2a8 (btrfs-uuid-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x39/0x180 stack backtrace: CPU: 73 PID: 7955 Comm: btrfs-uuid Kdump: loaded Not tainted 5.8.0-rc7-00167-g0d7ba0c5b375-dirty #925 Hardware name: Quanta Tioga Pass Single Side 01-0030993006/Tioga Pass Single Side, BIOS F08_3A18 12/20/2018 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x165/0x180 __lock_acquire+0x1272/0x2310 lock_acquire+0x9e/0x360 ? __btrfs_tree_read_lock+0x39/0x180 ? btrfs_root_node+0x1c/0x1d0 down_read_nested+0x3e/0x140 ? __btrfs_tree_read_lock+0x39/0x180 __btrfs_tree_read_lock+0x39/0x180 __btrfs_read_lock_root_node+0x3a/0x50 btrfs_search_slot+0x4bd/0x990 btrfs_find_root+0x45/0x1b0 btrfs_read_tree_root+0x61/0x100 btrfs_get_root_ref.part.50+0x143/0x630 btrfs_uuid_tree_iterate+0x207/0x314 ? btree_readpage+0x20/0x20 btrfs_uuid_rescan_kthread+0x12/0x50 kthread+0x133/0x150 ? kthread_create_on_node+0x60/0x60 ret_from_fork+0x1f/0x30 This problem exists because we have two different rescan threads, btrfs_uuid_scan_kthread which creates the uuid tree, and btrfs_uuid_tree_iterate that goes through and updates or deletes any out of date roots. The problem is they both do things in different order. btrfs_uuid_scan_kthread() reads the tree_root, and then inserts entries into the uuid_root. btrfs_uuid_tree_iterate() scans the uuid_root, but then does a btrfs_get_fs_root() which can read from the tree_root. It's actually easy enough to not be holding the path in btrfs_uuid_scan_kthread() when we add a uuid entry, as we already drop it further down and re-start the search when we loop. So simply move the path release before we add our entry to the uuid tree. This also fixes a problem where we're holding a path open after we do btrfs_end_transaction(), which has it's own problems. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
e3e39c72b9 |
btrfs: block-group: fix free-space bitmap threshold
[BUG] After commit |
|
|
|
9907ab3714 |
for-5.9-rc2-tag
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl9D5EkACgkQxWXV+ddt WDto+g/6A/2QzxhgOmqqHTiDvn3DkL60XfjB6lmq3NEvinrST+VH20EoX/EuX2Kn u2+gMiWrgBUwlvERkoSasxdJf/6dCCc+9zYDjjKkAxCckENT85Np71o3iEc7Z5z+ LFgS26mt6aYlCCHyIsHutzHtK2MKiUz7/oaUYZMJBHHkKS/5hL1mzIbwiWAqfU2H q0iMz9L2mjp1kZnpwa/yhg/NJ/oGZsKm3UPGDhdc0RlCWHBbDXHFk1wvNRo/yKQW l+yy0dh6PAZ45pRL0/WZwvOzcAglb+uSmwa64UOvwio4Na9P7oAcBzTFmtbBtvP4 WBrOUPCTzkvgQcmoAsWFpD4nrzgW4oS71EICTOIRlPx7A86TP3wYpFEygUlLCoZC Pd4e9mPClmW78hcRT12eJeGcJIzgoKWhR8597jNUEYz3R5T2wKHOcNnq9a1E1PLv zR+5MFShsylUHd7HbMC1O86XnfXe5esegNQMvx36kTS+cR9Dyt5EWMNIAYK4BPM3 /tXWZRqlZPOh3T7DZ4QR5oSSDDNq7ROTdv9jmsleno+woG0MNDYsA7jCbeJnGTmI CtTUP+p41otyM2lFZjV8PG/XyXDKb3UfU5gcsDOZdGP5S0tkyBiKSA6eqhz6DaTi fQOLGZdkNpNN/burbMq7d7YEHr3F6LC17U3L4k5V4MTAm2lp7ZQ= =ONgI -----END PGP SIGNATURE----- Merge tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix swapfile activation on subvolumes with deleted snapshots - error value mixup when removing directory entries from tree log - fix lzo compression level reset after previous level setting - fix space cache memory leak after transaction abort - fix const function attribute - more error handling improvements * tag 'for-5.9-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: detect nocow for swap after snapshot delete btrfs: check the right error variable in btrfs_del_dir_entries_in_log btrfs: fix space cache memory leak after transaction abort btrfs: use the correct const function attribute for btrfs_get_num_csums btrfs: reset compression level for lzo on remount btrfs: handle errors from async submission |
|
|
|
a84d5d429f |
btrfs: detect nocow for swap after snapshot delete
can_nocow_extent and btrfs_cross_ref_exist both rely on a heuristic for detecting a must cow condition which is not exactly accurate, but saves unnecessary tree traversal. The incorrect assumption is that if the extent was created in a generation smaller than the last snapshot generation, it must be referenced by that snapshot. That is true, except the snapshot could have since been deleted, without affecting the last snapshot generation. The original patch claimed a performance win from this check, but it also leads to a bug where you are unable to use a swapfile if you ever snapshotted the subvolume it's in. Make the check slower and more strict for the swapon case, without modifying the general cow checks as a compromise. Turning swap on does not seem to be a particularly performance sensitive operation, so incurring a possibly unnecessary btrfs_search_slot seems worthwhile for the added usability. Note: Until the snapshot is competely cleaned after deletion, check_committed_refs will still cause the logic to think that cow is necessary, so the user must until 'btrfs subvolu sync' finished before activating the swapfile swapon. CC: stable@vger.kernel.org # 5.4+ Suggested-by: Omar Sandoval <osandov@osandov.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fb2fecbad5 |
btrfs: check the right error variable in btrfs_del_dir_entries_in_log
With my new locking code dbench is so much faster that I tripped over a
transaction abort from ENOSPC. This turned out to be because
btrfs_del_dir_entries_in_log was checking for ret == -ENOSPC, but this
function sets err on error, and returns err. So instead of properly
marking the inode as needing a full commit, we were returning -ENOSPC
and aborting in __btrfs_unlink_inode. Fix this by checking the proper
variable so that we return the correct thing in the case of ENOSPC.
The ENOENT needs to be checked, because btrfs_lookup_dir_item_index()
can return -ENOENT if the dir item isn't in the tree log (which would
happen if we hadn't fsync'ed this guy). We actually handle that case in
__btrfs_unlink_inode, so it's an expected error to get back.
Fixes:
|
|
|
|
bbc37d6e47 |
btrfs: fix space cache memory leak after transaction abort
If a transaction aborts it can cause a memory leak of the pages array of
a block group's io_ctl structure. The following steps explain how that can
happen:
1) Transaction N is committing, currently in state TRANS_STATE_UNBLOCKED
and it's about to start writing out dirty extent buffers;
2) Transaction N + 1 already started and another task, task A, just called
btrfs_commit_transaction() on it;
3) Block group B was dirtied (extents allocated from it) by transaction
N + 1, so when task A calls btrfs_start_dirty_block_groups(), at the
very beginning of the transaction commit, it starts writeback for the
block group's space cache by calling btrfs_write_out_cache(), which
allocates the pages array for the block group's io_ctl with a call to
io_ctl_init(). Block group A is added to the io_list of transaction
N + 1 by btrfs_start_dirty_block_groups();
4) While transaction N's commit is writing out the extent buffers, it gets
an IO error and aborts transaction N, also setting the file system to
RO mode;
5) Task A has already returned from btrfs_start_dirty_block_groups(), is at
btrfs_commit_transaction() and has set transaction N + 1 state to
TRANS_STATE_COMMIT_START. Immediately after that it checks that the
filesystem was turned to RO mode, due to transaction N's abort, and
jumps to the "cleanup_transaction" label. After that we end up at
btrfs_cleanup_one_transaction() which calls btrfs_cleanup_dirty_bgs().
That helper finds block group B in the transaction's io_list but it
never releases the pages array of the block group's io_ctl, resulting in
a memory leak.
In fact at the point when we are at btrfs_cleanup_dirty_bgs(), the pages
array points to pages that were already released by us at
__btrfs_write_out_cache() through the call to io_ctl_drop_pages(). We end
up freeing the pages array only after waiting for the ordered extent to
complete through btrfs_wait_cache_io(), which calls io_ctl_free() to do
that. But in the transaction abort case we don't wait for the space cache's
ordered extent to complete through a call to btrfs_wait_cache_io(), so
that's why we end up with a memory leak - we wait for the ordered extent
to complete indirectly by shutting down the work queues and waiting for
any jobs in them to complete before returning from close_ctree().
We can solve the leak simply by freeing the pages array right after
releasing the pages (with the call to io_ctl_drop_pages()) at
__btrfs_write_out_cache(), since we will never use it anymore after that
and the pages array points to already released pages at that point, which
is currently not a problem since no one will use it after that, but not a
good practice anyway since it can easily lead to use-after-free issues.
So fix this by freeing the pages array right after releasing the pages at
__btrfs_write_out_cache().
This issue can often be reproduced with test case generic/475 from fstests
and kmemleak can detect it and reports it with the following trace:
unreferenced object 0xffff9bbf009fa600 (size 512):
comm "fsstress", pid 38807, jiffies 4298504428 (age 22.028s)
hex dump (first 32 bytes):
00 a0 7c 4d 3d ed ff ff 40 a0 7c 4d 3d ed ff ff ..|M=...@.|M=...
80 a0 7c 4d 3d ed ff ff c0 a0 7c 4d 3d ed ff ff ..|M=.....|M=...
backtrace:
[<00000000f4b5cfe2>] __kmalloc+0x1a8/0x3e0
[<0000000028665e7f>] io_ctl_init+0xa7/0x120 [btrfs]
[<00000000a1f95b2d>] __btrfs_write_out_cache+0x86/0x4a0 [btrfs]
[<00000000207ea1b0>] btrfs_write_out_cache+0x7f/0xf0 [btrfs]
[<00000000af21f534>] btrfs_start_dirty_block_groups+0x27b/0x580 [btrfs]
[<00000000c3c23d44>] btrfs_commit_transaction+0xa6f/0xe70 [btrfs]
[<000000009588930c>] create_subvol+0x581/0x9a0 [btrfs]
[<000000009ef2fd7f>] btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
[<00000000474e5187>] __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
[<00000000708ee349>] btrfs_ioctl_snap_create_v2+0xb0/0xf0 [btrfs]
[<00000000ea60106f>] btrfs_ioctl+0x12c/0x3130 [btrfs]
[<000000005c923d6d>] __x64_sys_ioctl+0x83/0xb0
[<0000000043ace2c9>] do_syscall_64+0x33/0x80
[<00000000904efbce>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
604997b4a3 |
btrfs: use the correct const function attribute for btrfs_get_num_csums
The build robot reports
compiler: h8300-linux-gcc (GCC) 9.3.0
In file included from fs/btrfs/tests/extent-map-tests.c:8:
>> fs/btrfs/tests/../ctree.h:2166:8: warning: type qualifiers ignored on function return type [-Wignored-qualifiers]
2166 | size_t __const btrfs_get_num_csums(void);
| ^~~~~~~
The function attribute for const does not follow the expected scheme and
in this case is confused with a const type qualifier.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
282dd7d771 |
btrfs: reset compression level for lzo on remount
Currently a user can set mount "-o compress" which will set the compression algorithm to zlib, and use the default compress level for zlib (3): relatime,compress=zlib:3,space_cache If the user remounts the fs using "-o compress=lzo", then the old compress_level is used: relatime,compress=lzo:3,space_cache But lzo does not expose any tunable compression level. The same happens if we set any compress argument with different level, also with zstd. Fix this by resetting the compress_level when compress=lzo is specified. With the fix applied, lzo is shown without compress level: relatime,compress=lzo,space_cache CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
c965d6402f |
btrfs: handle errors from async submission
Btrfs' async submit mechanism is able to handle errors in the submission path and the meta-data async submit function correctly passes the error code to the caller. In btrfs_submit_bio_start() and btrfs_submit_bio_start_direct_io() we're not handling the errors returned by btrfs_csum_one_bio() correctly though and simply call BUG_ON(). This is unnecessary as the caller of these two functions - run_one_async_start - correctly checks for the return values and sets the status of the async_submit_bio. The actual bio submission will be handled later on by run_one_async_done only if async_submit_bio::status is 0, so the data won't be written if we encountered an error in the checksum process. Simply return the error from btrfs_csum_one_bio() to the async submitters, like it's done in btree_submit_bio_start(). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
23c2c8c6fa |
for-5.9-tag
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAl81Q0wACgkQxWXV+ddt
WDtbqw/+NeFlvQzsCeQV9PX7RjYf9MFEQIThxo33xDl+ersgcOD8MuPa/hY1hoO0
gOn2eRPcVe/RIPBezRbxX9bnqlfW6N0VnBNLJHypMapB2hR6WFcFt7CAMoXKRmHV
RDM37pA2TNULr8XYrJ0+J5Vy1NWp5HdKzEV6bXfsOSzMSdAVMheXNec93suLEB/g
9QGXX6kaaq0Hcpy7tQQBtm2lbVj8/M3LOUAmYOB/JNCPtsJEB/2EO2b63TB4s2cW
0lpiPehW2m/Pv5GjqQM+iN5fbt9yhKB6lqEEgoHZPgI2tLFyh5WlTWKET7uxqj7G
YBzZjiq1WREEl9KWLYZuthcXPLX2XgJ4gLSlckygi1e4MpPlJ4pa30Bj9OyIEIjP
FOeR0lelRYcjmZrQW4Kana0qq8K0JJzvo2dSqaJBGF9CaveN3BAGQ9ttNhgIIpS5
4kBKlv2SCJ9Anhn8la6bFwlfuR2ggMhDShxIGBQpA1OKf0oJyi2dtavSIbuXwFbd
6KA37cyp4cDK9ycmTN5YxZSndzZSqUEh5Wt4gLk32NeIxhyCX4aTvjQj5KqM1MNw
N/WrTJQ27D6jfi+PBRBmT7U6qEujySXUimJRFTJzk+Px8Q/QMzGAFPCqz6iXv3u3
lX1Ywha9iQ0g2IZVoaq1ZjDDp4xOqIakAjaXez3dFhu3Mq3Kc70=
=7b8U
-----END PGP SIGNATURE-----
Merge tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"One minor update, the rest are fixes that have arrived a bit late for
the first batch. There are also some recent fixes for bugs that were
discovered during the merge window and pop up during testing.
User visible change:
- show correct subvolume path in /proc/mounts for bind mounts
Fixes:
- fix compression messages when remounting with different level or
compression algorithm
- tree-log: fix some memory leaks on error handling paths
- restore I_VERSION on remount
- fix return values and error code mixups
- fix umount crash with quotas enabled when removing sysfs files
- fix trim range on a shrunk device"
* tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
btrfs: fix return value mixup in btrfs_get_extent
btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups()
btrfs: check correct variable after allocation in btrfs_backref_iter_alloc
btrfs: make sure SB_I_VERSION doesn't get unset by remount
btrfs: fix memory leaks after failure to lookup checksums during inode logging
btrfs: don't show full path of bind mounts in subvol=
btrfs: fix messages after changing compression level by remount
btrfs: only search for left_info if there is no right_info in try_merge_free_space
btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
|
|
|
|
c57dd1f2f6 |
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
[BUG] The following script can lead to tons of beyond device boundary access: mkfs.btrfs -f $dev -b 10G mount $dev $mnt trimfs $mnt btrfs filesystem resize 1:-1G $mnt trimfs $mnt [CAUSE] Since commit |
|
|
|
881a3a11c2 |
btrfs: fix return value mixup in btrfs_get_extent
btrfs_get_extent() sets variable ret, but out: error path expect error
to be in variable err so the error code is lost.
Fixes:
|
|
|
|
62ab2cc04d |
btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups()
[BUG]
Unmounting a btrfs filesystem with quota disabled will cause the
following NULL pointer dereference:
BTRFS info (device dm-5): has skinny extents
BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
CPU: 7 PID: 637 Comm: umount Not tainted 5.8.0-rc7-next-20200731-custom #76
RIP: 0010:kobject_del+0x6/0x20
Call Trace:
btrfs_sysfs_del_qgroups+0xac/0xf0 [btrfs]
btrfs_free_qgroup_config+0x63/0x70 [btrfs]
close_ctree+0x1f5/0x323 [btrfs]
btrfs_put_super+0x15/0x17 [btrfs]
generic_shutdown_super+0x72/0x110
kill_anon_super+0x18/0x30
btrfs_kill_super+0x17/0x30 [btrfs]
deactivate_locked_super+0x3b/0xa0
deactivate_super+0x40/0x50
cleanup_mnt+0x135/0x190
__cleanup_mnt+0x12/0x20
task_work_run+0x64/0xb0
exit_to_user_mode_prepare+0x18a/0x190
syscall_exit_to_user_mode+0x4f/0x270
do_syscall_64+0x45/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
---[ end trace 37b7adca5c1d5c5d ]---
[CAUSE]
Commit
|
|
|
|
c15c2ec07a |
btrfs: check correct variable after allocation in btrfs_backref_iter_alloc
The `if (!ret)` check will always be false and it may result in
ret->path being dereferenced while it is a NULL pointer.
Fixes:
|
|
|
|
faa008899a |
btrfs: make sure SB_I_VERSION doesn't get unset by remount
There's some inconsistency around SB_I_VERSION handling with mount and remount. Since we don't really want it to be off ever just work around this by making sure we don't get the flag cleared on remount. There's a tiny cpu cost of setting the bit, otherwise all changes to i_version also change some of the times (ctime/mtime) so the inode needs to be synced. We wouldn't save anything by disabling it. Reported-by: Eric Sandeen <sandeen@redhat.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add perf impact analysis ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4f26433e9b |
btrfs: fix memory leaks after failure to lookup checksums during inode logging
While logging an inode, at copy_items(), if we fail to lookup the checksums
for an extent we release the destination path, free the ins_data array and
then return immediately. However a previous iteration of the for loop may
have added checksums to the ordered_sums list, in which case we leak the
memory used by them.
So fix this by making sure we iterate the ordered_sums list and free all
its checksums before returning.
Fixes:
|
|
|
|
3ef3959b29 |
btrfs: don't show full path of bind mounts in subvol=
Chris Murphy reported a problem where rpm ostree will bind mount a bunch of things for whatever voodoo it's doing. But when it does this /proc/mounts shows something like /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0 /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo/bar 0 0 Despite subvolid=256 being subvol=/foo. This is because we're just spitting out the dentry of the mount point, which in the case of bind mounts is the source path for the mountpoint. Instead we should spit out the path to the actual subvol. Fix this by looking up the name for the subvolid we have mounted. With this fix the same test looks like this /dev/sda /mnt/test btrfs rw,relatime,subvolid=256,subvol=/foo 0 0 /dev/sda /mnt/test/baz btrfs rw,relatime,subvolid=256,subvol=/foo 0 0 Reported-by: Chris Murphy <chris@colorremedies.com> CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
27942c9971 |
btrfs: fix messages after changing compression level by remount
Reported by Forza on IRC that remounting with compression options does not reflect the change in level, or at least it does not appear to do so according to the messages: mount -o compress=zstd:1 /dev/sda /mnt mount -o remount,compress=zstd:15 /mnt does not print the change to the level to syslog: [ 41.366060] BTRFS info (device vda): use zstd compression, level 1 [ 41.368254] BTRFS info (device vda): disk space caching is enabled [ 41.390429] BTRFS info (device vda): disk space caching is enabled What really happens is that the message is lost but the level is actualy changed. There's another weird output, if compression is reset to 'no': [ 45.413776] BTRFS info (device vda): use no compression, level 4 To fix that, save the previous compression level and print the message in that case too and use separate message for 'no' compression. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bf53d4687b |
btrfs: only search for left_info if there is no right_info in try_merge_free_space
In try_to_merge_free_space we attempt to find entries to the left and
right of the entry we are adding to see if they can be merged. We
search for an entry past our current info (saved into right_info), and
then if right_info exists and it has a rb_prev() we save the rb_prev()
into left_info.
However there's a slight problem in the case that we have a right_info,
but no entry previous to that entry. At that point we will search for
an entry just before the info we're attempting to insert. This will
simply find right_info again, and assign it to left_info, making them
both the same pointer.
Now if right_info _can_ be merged with the range we're inserting, we'll
add it to the info and free right_info. However further down we'll
access left_info, which was right_info, and thus get a use-after-free.
Fix this by only searching for the left entry if we don't find a right
entry at all.
The CVE referenced had a specially crafted file system that could
trigger this use-after-free. However with the tree checker improvements
we no longer trigger the conditions for the UAF. But the original
conditions still apply, hence this fix.
Reference: CVE-2019-19448
Fixes:
|
|
|
|
1e6e238c30 |
btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
[BUG]
There is a bug report of NULL pointer dereference caused in
compress_file_extent():
Oops: Kernel access of bad area, sig: 11 [#1]
LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
Workqueue: btrfs-delalloc btrfs_delalloc_helper [btrfs]
NIP [c008000006dd4d34] compress_file_range.constprop.41+0x75c/0x8a0 [btrfs]
LR [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs]
Call Trace:
[c000000c69093b00] [c008000006dd4d1c] compress_file_range.constprop.41+0x744/0x8a0 [btrfs] (unreliable)
[c000000c69093bd0] [c008000006dd4ebc] async_cow_start+0x44/0xa0 [btrfs]
[c000000c69093c10] [c008000006e14824] normal_work_helper+0xdc/0x598 [btrfs]
[c000000c69093c80] [c0000000001608c0] process_one_work+0x2c0/0x5b0
[c000000c69093d10] [c000000000160c38] worker_thread+0x88/0x660
[c000000c69093db0] [c00000000016b55c] kthread+0x1ac/0x1c0
[c000000c69093e20] [c00000000000b660] ret_from_kernel_thread+0x5c/0x7c
---[ end trace f16954aa20d822f6 ]---
[CAUSE]
For the following execution route of compress_file_range(), it's
possible to hit NULL pointer dereference:
compress_file_extent()
|- pages = NULL;
|- start = async_chunk->start = 0;
|- end = async_chunk = 4095;
|- nr_pages = 1;
|- inode_need_compress() == false; <<< Possible, see later explanation
| Now, we have nr_pages = 1, pages = NULL
|- cont:
|- ret = cow_file_range_inline();
|- if (ret <= 0) {
|- for (i = 0; i < nr_pages; i++) {
|- WARN_ON(pages[i]->mapping); <<< Crash
To enter above call execution branch, we need the following race:
Thread 1 (chattr) | Thread 2 (writeback)
--------------------------+------------------------------
| btrfs_run_delalloc_range
| |- inode_need_compress = true
| |- cow_file_range_async()
btrfs_ioctl_set_flag() |
|- binode_flags |= |
BTRFS_INODE_NOCOMPRESS |
| compress_file_range()
| |- inode_need_compress = false
| |- nr_page = 1 while pages = NULL
| | Then hit the crash
[FIX]
This patch will fix it by checking @pages before doing accessing it.
This patch is only designed as a hot fix and easy to backport.
More elegant fix may make btrfs only check inode_need_compress() once to
avoid such race, but that would be another story.
Reported-by: Luciano Chavez <chavez@us.ibm.com>
Fixes:
|
|
|
|
b79675e15a |
Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
"No common topic whatsoever in those, sorry"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: define inode flags using bit numbers
iov_iter: Move unnecessary inclusion of crypto/hash.h
dlmfs: clean up dlmfs_file_{read,write}() a bit
|
|
|
|
8f0cb6660a |
These are the latest RCU bits for v5.9:
- kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAl8n80ERHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gauA/+NtuExW9V9cPDZ8AAp6x6QfoEIgqN4VEk pYuyP0+ZbmwH+h8z7qPqMrwxUHQnhef7gqtlWa7wj9MawbEbmqnA/3uivjX/3Aao bGMMXkqXppc6hgwktgLNk8vfq3LRVEH2P0i0I+Tymgxu3DCHSGRep4LWfdAS/q3z 4pe5JXqdMx+Qnfy/bsVxJTaJAncMq1LQNAtWY1TIwK8L8RmpXrj5dvuLKUr7q+zl P+BfXyrdX+x05TpmHHnI/bR3w9yASL32E0S3IaQYRRqH8TsUIGHWe13Ib6hKXXG5 j7W5KrsOgr0fQBxi+JW2fgGQkrua4o7yk4H2Ygj+Fi5RvP2uqNZdvXFAlP2cUMu/ 7Pg8+7kC6jKIrwpD03s9ZZzm0QN3jsCxFs2PEkkHMzjXbe1CI4tIkTH6ex1uvjR2 v3OhCIp6ypxpEIJbFQucia0iQ4NF+evKjqCvRkbepqQ096jg+CNFh0VG0Tp8XR+y Gk9B9oXvLLPMd6ah5CI9nLJKiMWVRV8mvvqspoblGo//+39ksh4mzxm865tFXYg4 C+DPJvKlY15Ib5eJ/xr8EZ/oS0K2sUF9sMYnK4P8QMhyTBMbpAZiljHYK+Wujt8I g/JCWxrEMv3LHPY9/guB5Nod/Qb4Jqqm9iE9qEX3MQxtt2O2nmmWd91pzFcUXlFU RDBWYJ63Okg= =rNhf -----END PGP SIGNATURE----- Merge tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull RCU updates from Ingo Molnar: - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates - Documentation updates - Miscellaneous fixes * tag 'core-rcu-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (109 commits) torture: Remove obsolete "cd $KVM" torture: Avoid duplicate specification of qemu command torture: Dump ftrace at shutdown only if requested torture: Add kvm-tranform.sh script for qemu-cmd files torture: Add more tracing crib notes to kvm.sh torture: Improve diagnostic for KCSAN-incapable compilers torture: Correctly summarize build-only runs torture: Pass --kmake-arg to all make invocations rcutorture: Check for unwatched readers torture: Abstract out console-log error detection torture: Add a stop-run capability torture: Create qemu-cmd in --buildonly runs rcu/rcutorture: Replace 0 with false torture: Add --allcpus argument to the kvm.sh script torture: Remove whitespace from identify_qemu_vcpus output rcutorture: NULL rcu_torture_current earlier in cleanup code rcutorture: Handle non-statistic bang-string error messages torture: Set configfile variable to current scenario rcutorture: Add races with task-exit processing locktorture: Use true and false to assign to bool variables ... |
|
|
|
cdc8fcb499 |
for-5.9/io_uring-20200802
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7asQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgplrCD/0S17kio+k4cOJDGwl88WoJw+QiYmM5019k
decZ1JymQvV1HXRmlcZiEAu0hHDD0FoovSRrw7II3gw3GouETmYQM62f6ZTpDeMD
CED/fidnfULAkPaI6h+bj3jyI0cEuujG/R47rGSQEkIIr3RttqKZUzVkB9KN+KMw
+OBuXZtMIoFFEVJ91qwC2dm2qHLqOn1/5MlT59knso/xbPOYOXsFQpGiACJqF97x
6qSSI8uGE+HZqvL2OLWPDBbLEJhrq+dzCgxln5VlvLele4UcRhOdonUb7nUwEKCe
zwvtXzz16u1D1b8bJL4Kg5bGqyUAQUCSShsfBJJxh6vTTULiHyCX5sQaai1OEB16
4dpBL9E+nOUUix4wo9XBY0/KIYaPWg5L1CoEwkAXqkXPhFvNUucsC0u6KvmzZR3V
1OogVTjl6GhS8uEVQjTKNshkTIC9QHEMXDUOHtINDCb/sLU+ANXU5UpvsuzZ9+kt
KGc4mdyCwaKBq4YW9sVwhhq/RHLD4AUtWZiUVfOE+0cltCLJUNMbQsJ+XrcYaQnm
W4zz22Rep+SJuQNVcCW/w7N2zN3yB6gC1qeroSLvzw4b5el2TdFp+BcgVlLHK+uh
xjsGNCq++fyzNk7vvMZ5hVq4JGXYjza7AiP5HlQ8nqdiPUKUPatWCBqUm9i9Cz/B
n+0dlYbRwQ==
=2vmy
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
Pull io_uring updates from Jens Axboe:
"Lots of cleanups in here, hardening the code and/or making it easier
to read and fixing bugs, but a core feature/change too adding support
for real async buffered reads. With the latter in place, we just need
buffered write async support and we're done relying on kthreads for
the fast path. In detail:
- Cleanup how memory accounting is done on ring setup/free (Bijan)
- sq array offset calculation fixup (Dmitry)
- Consistently handle blocking off O_DIRECT submission path (me)
- Support proper async buffered reads, instead of relying on kthread
offload for that. This uses the page waitqueue to drive retries
from task_work, like we handle poll based retry. (me)
- IO completion optimizations (me)
- Fix race with accounting and ring fd install (me)
- Support EPOLLEXCLUSIVE (Jiufei)
- Get rid of the io_kiocb unionizing, made possible by shrinking
other bits (Pavel)
- Completion side cleanups (Pavel)
- Cleanup REQ_F_ flags handling, and kill off many of them (Pavel)
- Request environment grabbing cleanups (Pavel)
- File and socket read/write cleanups (Pavel)
- Improve kiocb_set_rw_flags() (Pavel)
- Tons of fixes and cleanups (Pavel)
- IORING_SQ_NEED_WAKEUP clear fix (Xiaoguang)"
* tag 'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block: (127 commits)
io_uring: flip if handling after io_setup_async_rw
fs: optimise kiocb_set_rw_flags()
io_uring: don't touch 'ctx' after installing file descriptor
io_uring: get rid of atomic FAA for cq_timeouts
io_uring: consolidate *_check_overflow accounting
io_uring: fix stalled deferred requests
io_uring: fix racy overflow count reporting
io_uring: deduplicate __io_complete_rw()
io_uring: de-unionise io_kiocb
io-wq: update hash bits
io_uring: fix missing io_queue_linked_timeout()
io_uring: mark ->work uninitialised after cleanup
io_uring: deduplicate io_grab_files() calls
io_uring: don't do opcode prep twice
io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works
io_uring: batch put_task_struct()
tasks: add put_task_struct_many()
io_uring: return locked and pinned page accounting
io_uring: don't miscount pinned memory
io_uring: don't open-code recv kbuf managment
...
|
|
|
|
382625d0d4 |
for-5.9/block-20200802
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl8m7YwQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpt+dEAC7a0HYuX2OrkyawBnsgd1QQR/soC7surec
yDDa7SMM8cOq3935bfzcYHV9FWJszEGIknchiGb9R3/T+vmSohbvDsM5zgwya9u/
FHUIuTq324I6JWXKl30k4rwjiX9wQeMt+WZ5gC8KJYCWA296i2IpJwd0A45aaKuS
x4bTjxqknE+fD4gQiMUSt+bmuOUAp81fEku3EPapCRYDPAj8f5uoY7R2arT/POwB
b+s+AtXqzBymIqx1z0sZ/XcdZKmDuhdurGCWu7BfJFIzw5kQ2Qe3W8rUmrQ3pGut
8a21YfilhUFiBv+B4wptfrzJuzU6Ps0BXHCnBsQjzvXwq5uFcZH495mM/4E4OJvh
SbjL2K4iFj+O1ngFkukG/F8tdEM1zKBYy2ZEkGoWKUpyQanbAaGI6QKKJA+DCdBi
yPEb7yRAa5KfLqMiocm1qCEO1I56HRiNHaJVMqCPOZxLmpXj19Fs71yIRplP1Trv
GGXdWZsccjuY6OljoXWdEfnxAr5zBsO3Yf2yFT95AD+egtGsU1oOzlqAaU1mtflw
ABo452pvh6FFpxGXqz6oK4VqY4Et7WgXOiljA4yIGoPpG/08L1Yle4eVc2EE01Jb
+BL49xNJVeUhGFrvUjPGl9kVMeLmubPFbmgrtipW+VRg9W8+Yirw7DPP6K+gbPAR
RzAUdZFbWw==
=abJG
-----END PGP SIGNATURE-----
Merge tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block
Pull core block updates from Jens Axboe:
"Good amount of cleanups and tech debt removals in here, and as a
result, the diffstat shows a nice net reduction in code.
- Softirq completion cleanups (Christoph)
- Stop using ->queuedata (Christoph)
- Cleanup bd claiming (Christoph)
- Use check_events, moving away from the legacy media change
(Christoph)
- Use inode i_blkbits consistently (Christoph)
- Remove old unused writeback congestion bits (Christoph)
- Cleanup/unify submission path (Christoph)
- Use bio_uninit consistently, instead of bio_disassociate_blkg
(Christoph)
- sbitmap cleared bits handling (John)
- Request merging blktrace event addition (Jan)
- sysfs add/remove race fixes (Luis)
- blk-mq tag fixes/optimizations (Ming)
- Duplicate words in comments (Randy)
- Flush deferral cleanup (Yufen)
- IO context locking/retry fixes (John)
- struct_size() usage (Gustavo)
- blk-iocost fixes (Chengming)
- blk-cgroup IO stats fixes (Boris)
- Various little fixes"
* tag 'for-5.9/block-20200802' of git://git.kernel.dk/linux-block: (135 commits)
block: blk-timeout: delete duplicated word
block: blk-mq-sched: delete duplicated word
block: blk-mq: delete duplicated word
block: genhd: delete duplicated words
block: elevator: delete duplicated word and fix typos
block: bio: delete duplicated words
block: bfq-iosched: fix duplicated word
iocost_monitor: start from the oldest usage index
iocost: Fix check condition of iocg abs_vdebt
block: Remove callback typedefs for blk_mq_ops
block: Use non _rcu version of list functions for tag_set_list
blk-cgroup: show global disk stats in root cgroup io.stat
blk-cgroup: make iostat functions visible to stat printing
block: improve discard bio alignment in __blkdev_issue_discard()
block: change REQ_OP_ZONE_RESET and REQ_OP_ZONE_RESET_ALL to be odd numbers
block: defer flush request no matter whether we have elevator
block: make blk_timeout_init() static
block: remove retry loop in ioc_release_fn()
block: remove unnecessary ioc nested locking
block: integrate bd_start_claiming into __blkdev_get
...
|
|
|
|
c1cc4784ce |
Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull the v5.9 RCU bits from Paul E. McKenney: - Documentation updates - Miscellaneous fixes - kfree_rcu updates - RCU tasks updates - Read-side scalability tests - SRCU updates - Torture-test updates Signed-off-by: Ingo Molnar <mingo@kernel.org> |
|
|
|
5e548b3201 |
btrfs: do not set the full sync flag on the inode during page release
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.
This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.
We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.
This patch is part of a patchset that consists of 3 patches, which have
the following subjects:
1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release
Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.
The following script, using fio, was used to perform the tests:
$ cat test-fsync.sh
#!/bin/bash
DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
if [ $# -ne 3 ]; then
echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
exit 1
fi
NUM_JOBS=$1
FILE_SIZE=$2
FSYNC_FREQ=$3
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=write
fsync=$FSYNC_FREQ
fallocate=none
group_reporting=1
direct=0
bs=64k
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
thread
EOF
echo "Using config:"
echo
cat /tmp/fio-job.ini
echo
mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
fio /tmp/fio-job.ini
umount $MNT
The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.
The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).
*****************************************************
**** 1 job, 32GiB file, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec
After patchset:
WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)
*****************************************************
**** 2 jobs, 16GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec
After patchset:
WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)
*****************************************************
**** 4 jobs, 8GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec
After patchset:
WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)
*****************************************************
**** 8 jobs, 4GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec
After patchset:
WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)
*****************************************************
**** 16 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec
After patchset:
WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)
*****************************************************
**** 32 jobs, 2GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec
After patchset:
WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)
*****************************************************
**** 64 jobs, 1GiB files, fsync frequency 1 ****
*****************************************************
Before patchset:
WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec
After patchset:
WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
fbc2bd7e7a |
btrfs: release old extent maps during page release
When removing an extent map at try_release_extent_mapping(), called through the page release callback (btrfs_releasepage()), we never release an extent map that is in the list of modified extents. This is to prevent races with a concurrent fsync using the fast path, which could lead to not logging an extent created in the current transaction. However we can safely remove an extent map created in a past transaction that is still in the list of modified extents (because no one fsynced yet the inode after that transaction got commited), because such extents are skipped during an fsync as it is pointless to log them. This change does that. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3d6448e631 |
btrfs: fix race between page release and a fast fsync
When releasing an extent map, done through the page release callback, we
can race with an ongoing fast fsync and cause the fsync to miss a new
extent and not log it. The steps for this to happen are the following:
1) A page is dirtied for some inode I;
2) Writeback for that page is triggered by a path other than fsync, for
example by the system due to memory pressure;
3) When the ordered extent for the extent (a single 4K page) finishes,
we unpin the corresponding extent map and set its generation to N,
the current transaction's generation;
4) The btrfs_releasepage() callback is invoked by the system due to
memory pressure for that no longer dirty page of inode I;
5) At the same time, some task calls fsync on inode I, joins transaction
N, and at btrfs_log_inode() it sees that the inode does not have the
full sync flag set, so we proceed with a fast fsync. But before we get
into btrfs_log_changed_extents() and lock the inode's extent map tree:
6) Through btrfs_releasepage() we end up at try_release_extent_mapping()
and we remove the extent map for the new 4Kb extent, because it is
neither pinned anymore nor locked. By calling remove_extent_mapping(),
we remove the extent map from the list of modified extents, since the
extent map does not have the logging flag set. We unlock the inode's
extent map tree;
7) The task doing the fast fsync now enters btrfs_log_changed_extents(),
locks the inode's extent map tree and iterates its list of modified
extents, which no longer has the 4Kb extent in it, so it does not log
the extent;
8) The fsync finishes;
9) Before transaction N is committed, a power failure happens. After
replaying the log, the 4K extent of inode I will be missing, since
it was not logged due to the race with try_release_extent_mapping().
So fix this by teaching try_release_extent_mapping() to not remove an
extent map if it's still in the list of modified extents.
Fixes:
|
|
|
|
88c4703f00 |
btrfs: open-code remount flag setting in btrfs_remount
When we're (re)mounting a btrfs filesystem we set the BTRFS_FS_STATE_REMOUNTING state in fs_info to serialize against async reclaim or defrags. This flag is set in btrfs_remount_prepare() called by btrfs_remount(). As btrfs_remount_prepare() does nothing but setting this flag and doesn't have a second caller, we can just open-code the flag setting in btrfs_remount(). Similarly do for so clearing of the flag by moving it out of btrfs_remount_cleanup() into btrfs_remount() to be symmetrical. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
162e0a16b7 |
btrfs: if we're restriping, use the target restripe profile
Previously we depended on some weird behavior in our chunk allocator to force the allocation of new stripes, so by the time we got to doing the reduce we would usually already have a chunk with the proper target. However that behavior causes other problems and needs to be removed. First however we need to remove this check to only restripe if we already have those available profiles, because if we're allocating our first chunk it obviously will not be available. Simply use the target as specified, and if that fails it'll be because we're out of space. Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
349e120ece |
btrfs: don't adjust bg flags and use default allocation profiles
btrfs/061 has been failing consistently for me recently with a
transaction abort. We run out of space in the system chunk array, which
means we've allocated way too many system chunks than we need.
Chris added this a long time ago for balance as a poor mans restriping.
If you had a single disk and then added another disk and then did a
balance, update_block_group_flags would then figure out which RAID level
you needed.
Fast forward to today and we have restriping behavior, so we can
explicitly tell the fs that we're trying to change the raid level. This
is accomplished through the normal get_alloc_profile path.
Furthermore this code actually causes btrfs/061 to fail, because we do
things like mkfs -m dup -d single with multiple devices. This trips
this check
alloc_flags = update_block_group_flags(fs_info, cache->flags);
if (alloc_flags != cache->flags) {
ret = btrfs_chunk_alloc(trans, alloc_flags, CHUNK_ALLOC_FORCE);
in btrfs_inc_block_group_ro. Because we're balancing and scrubbing, but
not actually restriping, we keep forcing chunk allocation of RAID1
chunks. This eventually causes us to run out of system space and the
file system aborts and flips read only.
We don't need this poor mans restriping any more, simply use the normal
get_alloc_profile helper, which will get the correct alloc_flags and
thus make the right decision for chunk allocation. This keeps us from
allocating a billion system chunks and falling over.
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
ab0db043c3 |
btrfs: fix lockdep splat from btrfs_dump_space_info
When running with -o enospc_debug you can get the following splat if one of the dump_space_info's trip ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc5+ #20 Tainted: G OE ------------------------------------------------------ dd/563090 is trying to acquire lock: ffff9e7dbf4f1e18 (&ctl->tree_lock){+.+.}-{2:2}, at: btrfs_dump_free_space+0x2b/0xa0 [btrfs] but task is already holding lock: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&cache->lock){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 btrfs_add_reserved_bytes+0x3c/0x3c0 [btrfs] find_free_extent+0x7ef/0x13b0 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x340 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x122/0x530 [btrfs] btrfs_cow_block+0x106/0x210 [btrfs] commit_cowonly_roots+0x55/0x300 [btrfs] btrfs_commit_transaction+0x4ed/0xac0 [btrfs] sync_filesystem+0x74/0x90 generic_shutdown_super+0x22/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x36/0x70 cleanup_mnt+0x104/0x160 task_work_run+0x5f/0x90 __prepare_exit_to_usermode+0x1bd/0x1c0 do_syscall_64+0x5e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&space_info->lock){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 btrfs_block_rsv_release+0x1a6/0x3f0 [btrfs] btrfs_inode_rsv_release+0x4f/0x170 [btrfs] btrfs_clear_delalloc_extent+0x155/0x480 [btrfs] clear_state_bit+0x81/0x1a0 [btrfs] __clear_extent_bit+0x25c/0x5d0 [btrfs] clear_extent_bit+0x15/0x20 [btrfs] btrfs_invalidatepage+0x2b7/0x3c0 [btrfs] truncate_cleanup_page+0x47/0xe0 truncate_inode_pages_range+0x238/0x840 truncate_pagecache+0x44/0x60 btrfs_setattr+0x202/0x5e0 [btrfs] notify_change+0x33b/0x490 do_truncate+0x76/0xd0 path_openat+0x687/0xa10 do_filp_open+0x91/0x100 do_sys_openat2+0x215/0x2d0 do_sys_open+0x44/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&tree->lock#2){+.+.}-{2:2}: _raw_spin_lock+0x25/0x30 find_first_extent_bit+0x32/0x150 [btrfs] write_pinned_extent_entries.isra.0+0xc5/0x100 [btrfs] __btrfs_write_out_cache+0x172/0x480 [btrfs] btrfs_write_out_cache+0x7a/0xf0 [btrfs] btrfs_write_dirty_block_groups+0x286/0x3b0 [btrfs] commit_cowonly_roots+0x245/0x300 [btrfs] btrfs_commit_transaction+0x4ed/0xac0 [btrfs] close_ctree+0xf9/0x2f5 [btrfs] generic_shutdown_super+0x6c/0x100 kill_anon_super+0x14/0x30 btrfs_kill_super+0x12/0x20 [btrfs] deactivate_locked_super+0x36/0x70 cleanup_mnt+0x104/0x160 task_work_run+0x5f/0x90 __prepare_exit_to_usermode+0x1bd/0x1c0 do_syscall_64+0x5e/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&ctl->tree_lock){+.+.}-{2:2}: __lock_acquire+0x1240/0x2460 lock_acquire+0xab/0x360 _raw_spin_lock+0x25/0x30 btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_space_info+0xf4/0x120 [btrfs] btrfs_reserve_extent+0x176/0x180 [btrfs] __btrfs_prealloc_file_range+0x145/0x550 [btrfs] cache_save_setup+0x28d/0x3b0 [btrfs] btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs] btrfs_commit_transaction+0xcc/0xac0 [btrfs] btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs] btrfs_file_write_iter+0x3cf/0x610 [btrfs] new_sync_write+0x11e/0x1b0 vfs_write+0x1c9/0x200 ksys_write+0x68/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Chain exists of: &ctl->tree_lock --> &space_info->lock --> &cache->lock Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&cache->lock); lock(&space_info->lock); lock(&cache->lock); lock(&ctl->tree_lock); *** DEADLOCK *** 6 locks held by dd/563090: #0: ffff9e7e21d18448 (sb_writers#14){.+.+}-{0:0}, at: vfs_write+0x195/0x200 #1: ffff9e7dd0410ed8 (&sb->s_type->i_mutex_key#19){++++}-{3:3}, at: btrfs_file_write_iter+0x86/0x610 [btrfs] #2: ffff9e7e21d18638 (sb_internal#2){.+.+}-{0:0}, at: start_transaction+0x40b/0x5b0 [btrfs] #3: ffff9e7e1f05d688 (&cur_trans->cache_write_mutex){+.+.}-{3:3}, at: btrfs_start_dirty_block_groups+0x158/0x4f0 [btrfs] #4: ffff9e7e2284ddb8 (&space_info->groups_sem){++++}-{3:3}, at: btrfs_dump_space_info+0x69/0x120 [btrfs] #5: ffff9e7e2284d428 (&cache->lock){+.+.}-{2:2}, at: btrfs_dump_space_info+0xaa/0x120 [btrfs] stack backtrace: CPU: 3 PID: 563090 Comm: dd Tainted: G OE 5.8.0-rc5+ #20 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 Call Trace: dump_stack+0x96/0xd0 check_noncircular+0x162/0x180 __lock_acquire+0x1240/0x2460 ? wake_up_klogd.part.0+0x30/0x40 lock_acquire+0xab/0x360 ? btrfs_dump_free_space+0x2b/0xa0 [btrfs] _raw_spin_lock+0x25/0x30 ? btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_free_space+0x2b/0xa0 [btrfs] btrfs_dump_space_info+0xf4/0x120 [btrfs] btrfs_reserve_extent+0x176/0x180 [btrfs] __btrfs_prealloc_file_range+0x145/0x550 [btrfs] ? btrfs_qgroup_reserve_data+0x1d/0x60 [btrfs] cache_save_setup+0x28d/0x3b0 [btrfs] btrfs_start_dirty_block_groups+0x1fc/0x4f0 [btrfs] btrfs_commit_transaction+0xcc/0xac0 [btrfs] ? start_transaction+0xe0/0x5b0 [btrfs] btrfs_alloc_data_chunk_ondemand+0x162/0x4c0 [btrfs] btrfs_check_data_free_space+0x4c/0xa0 [btrfs] btrfs_buffered_write.isra.0+0x19b/0x740 [btrfs] ? ktime_get_coarse_real_ts64+0xa8/0xd0 ? trace_hardirqs_on+0x1c/0xe0 btrfs_file_write_iter+0x3cf/0x610 [btrfs] new_sync_write+0x11e/0x1b0 vfs_write+0x1c9/0x200 ksys_write+0x68/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is because we're holding the block_group->lock while trying to dump the free space cache. However we don't need this lock, we just need it to read the values for the printk, so move the free space cache dumping outside of the block group lock. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
01d01caf19 |
btrfs: move the chunk_mutex in btrfs_read_chunk_tree
We are currently getting this lockdep splat in btrfs/161: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc5+ #20 Tainted: G E ------------------------------------------------------ mount/678048 is trying to acquire lock: ffff9b769f15b6e0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: clone_fs_devices+0x4d/0x170 [btrfs] but task is already holding lock: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __mutex_lock+0x8b/0x8f0 btrfs_init_new_device+0x2d2/0x1240 [btrfs] btrfs_ioctl+0x1de/0x2d20 [btrfs] ksys_ioctl+0x87/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __lock_acquire+0x1240/0x2460 lock_acquire+0xab/0x360 __mutex_lock+0x8b/0x8f0 clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] btrfs_mount_root.cold+0x13/0xfa [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 other info that might help us debug this: Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); lock(&fs_info->chunk_mutex); lock(&fs_devs->device_list_mutex); *** DEADLOCK *** 3 locks held by mount/678048: #0: ffff9b75ff5fb0e0 (&type->s_umount_key#63/1){+.+.}-{3:3}, at: alloc_super+0xb5/0x380 #1: ffffffffc0c2fbc8 (uuid_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x54/0x800 [btrfs] #2: ffff9b76abdb08d0 (&fs_info->chunk_mutex){+.+.}-{3:3}, at: btrfs_read_chunk_tree+0x6a/0x800 [btrfs] stack backtrace: CPU: 2 PID: 678048 Comm: mount Tainted: G E 5.8.0-rc5+ #20 Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./890FX Deluxe5, BIOS P1.40 05/03/2011 Call Trace: dump_stack+0x96/0xd0 check_noncircular+0x162/0x180 __lock_acquire+0x1240/0x2460 ? asm_sysvec_apic_timer_interrupt+0x12/0x20 lock_acquire+0xab/0x360 ? clone_fs_devices+0x4d/0x170 [btrfs] __mutex_lock+0x8b/0x8f0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? rcu_read_lock_sched_held+0x52/0x60 ? cpumask_next+0x16/0x20 ? module_assert_mutex_or_preempt+0x14/0x40 ? __module_address+0x28/0xf0 ? clone_fs_devices+0x4d/0x170 [btrfs] ? static_obj+0x4f/0x60 ? lockdep_init_map_waits+0x43/0x200 ? clone_fs_devices+0x4d/0x170 [btrfs] clone_fs_devices+0x4d/0x170 [btrfs] btrfs_read_chunk_tree+0x330/0x800 [btrfs] open_ctree+0xb7c/0x18ce [btrfs] ? super_setup_bdi_name+0x79/0xd0 btrfs_mount_root.cold+0x13/0xfa [btrfs] ? vfs_parse_fs_string+0x84/0xb0 ? rcu_read_lock_sched_held+0x52/0x60 ? kfree+0x2b5/0x310 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 fc_mount+0xe/0x40 vfs_kern_mount.part.0+0x71/0x90 btrfs_mount+0x13b/0x3e0 [btrfs] ? cred_has_capability+0x7c/0x120 ? rcu_read_lock_sched_held+0x52/0x60 ? legacy_get_tree+0x30/0x50 legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x7de/0xb30 ? memdup_user+0x4e/0x90 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 This is because btrfs_read_chunk_tree() can come upon DEV_EXTENT's and then read the device, which takes the device_list_mutex. The device_list_mutex needs to be taken before the chunk_mutex, so this is a problem. We only really need the chunk mutex around adding the chunk, so move the mutex around read_one_chunk. An argument could be made that we don't even need the chunk_mutex here as it's during mount, and we are protected by various other locks. However we already have special rules for ->device_list_mutex, and I'd rather not have another special case for ->chunk_mutex. CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
18c850fdc5 |
btrfs: open device without device_list_mutex
There's long existed a lockdep splat because we open our bdev's under the ->device_list_mutex at mount time, which acquires the bd_mutex. Usually this goes unnoticed, but if you do loopback devices at all suddenly the bd_mutex comes with a whole host of other dependencies, which results in the splat when you mount a btrfs file system. ====================================================== WARNING: possible circular locking dependency detected 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted ------------------------------------------------------ systemd-journal/509 is trying to acquire lock: ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs] but task is already holding lock: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #6 (sb_pagefaults){.+.+}-{0:0}: __sb_start_write+0x13e/0x220 btrfs_page_mkwrite+0x59/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 -> #5 (&mm->mmap_lock#2){++++}-{3:3}: __might_fault+0x60/0x80 _copy_from_user+0x20/0xb0 get_sg_io_hdr+0x9a/0xb0 scsi_cmd_ioctl+0x1ea/0x2f0 cdrom_ioctl+0x3c/0x12b4 sr_block_ioctl+0xa4/0xd0 block_ioctl+0x3f/0x50 ksys_ioctl+0x82/0xc0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #4 (&cd->lock){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 sr_block_open+0xa2/0x180 __blkdev_get+0xdd/0x550 blkdev_get+0x38/0x150 do_dentry_open+0x16b/0x3e0 path_openat+0x3c9/0xa00 do_filp_open+0x75/0x100 do_sys_openat2+0x8a/0x140 __x64_sys_openat+0x46/0x70 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 __blkdev_get+0x6a/0x550 blkdev_get+0x85/0x150 blkdev_get_by_path+0x2c/0x70 btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs] open_fs_devices+0x88/0x240 [btrfs] btrfs_open_devices+0x92/0xa0 [btrfs] btrfs_mount_root+0x250/0x490 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 vfs_kern_mount.part.0+0x71/0xb0 btrfs_mount+0x119/0x380 [btrfs] legacy_get_tree+0x30/0x50 vfs_get_tree+0x28/0xc0 do_mount+0x8c6/0xca0 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_run_dev_stats+0x36/0x420 [btrfs] commit_cowonly_roots+0x91/0x2d0 [btrfs] btrfs_commit_transaction+0x4e6/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}: __mutex_lock+0x7b/0x820 btrfs_commit_transaction+0x48e/0x9f0 [btrfs] btrfs_sync_file+0x38a/0x480 [btrfs] __x64_sys_fdatasync+0x47/0x80 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}: __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 __mutex_lock+0x7b/0x820 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 asm_exc_page_fault+0x1e/0x30 other info that might help us debug this: Chain exists of: &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(sb_pagefaults); lock(&mm->mmap_lock#2); lock(sb_pagefaults); lock(&fs_info->reloc_mutex); *** DEADLOCK *** 3 locks held by systemd-journal/509: #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs] #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs] stack backtrace: CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x92/0xc8 check_noncircular+0x134/0x150 __lock_acquire+0x1241/0x20c0 lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? lock_acquire+0xb0/0x400 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] __mutex_lock+0x7b/0x820 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs] ? kvm_sched_clock_read+0x14/0x30 ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xc/0xb0 btrfs_record_root_in_trans+0x44/0x70 [btrfs] start_transaction+0xd2/0x500 [btrfs] btrfs_dirty_inode+0x44/0xd0 [btrfs] file_update_time+0xc6/0x120 btrfs_page_mkwrite+0xda/0x560 [btrfs] ? sched_clock+0x5/0x10 do_page_mkwrite+0x4f/0x130 do_wp_page+0x3b0/0x4f0 handle_mm_fault+0xf47/0x1850 do_user_addr_fault+0x1fc/0x4b0 exc_page_fault+0x88/0x300 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x7fa3972fdbfe Code: Bad RIP value. Fix this by not holding the ->device_list_mutex at this point. The device_list_mutex exists to protect us from modifying the device list while the file system is running. However it can also be modified by doing a scan on a device. But this action is specifically protected by the uuid_mutex, which we are holding here. We cannot race with opening at this point because we have the ->s_mount lock held during the mount. Not having the ->device_list_mutex here is perfectly safe as we're not going to change the devices at this point. CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add some comments ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
a47bd78d0c |
btrfs: sysfs: use NOFS for device creation
Dave hit this splat during testing btrfs/078: ====================================================== WARNING: possible circular locking dependency detected 5.8.0-rc6-default+ #1191 Not tainted ------------------------------------------------------ kswapd0/75 is trying to acquire lock: ffffa040e9d04ff8 (&delayed_node->mutex){+.+.}-{3:3}, at: __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] but task is already holding lock: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #2 (fs_reclaim){+.+.}-{0:0}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 fs_reclaim_acquire.part.0+0x25/0x30 __kmalloc_track_caller+0x49/0x330 kstrdup+0x2e/0x60 __kernfs_new_node.constprop.0+0x44/0x250 kernfs_new_node+0x25/0x50 kernfs_create_link+0x34/0xa0 sysfs_do_create_link_sd+0x5e/0xd0 btrfs_sysfs_add_devices_dir+0x65/0x100 [btrfs] btrfs_init_new_device+0x44c/0x12b0 [btrfs] btrfs_ioctl+0xc3c/0x25c0 [btrfs] ksys_ioctl+0x68/0xa0 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 -> #1 (&fs_info->chunk_mutex){+.+.}-{3:3}: __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 btrfs_chunk_alloc+0x137/0x3e0 [btrfs] find_free_extent+0xb44/0xfb0 [btrfs] btrfs_reserve_extent+0x9b/0x180 [btrfs] btrfs_alloc_tree_block+0xc1/0x350 [btrfs] alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs] __btrfs_cow_block+0x143/0x7a0 [btrfs] btrfs_cow_block+0x15f/0x310 [btrfs] push_leaf_right+0x150/0x240 [btrfs] split_leaf+0x3cd/0x6d0 [btrfs] btrfs_search_slot+0xd14/0xf70 [btrfs] btrfs_insert_empty_items+0x64/0xc0 [btrfs] __btrfs_commit_inode_delayed_items+0xb2/0x840 [btrfs] btrfs_async_run_delayed_root+0x10e/0x1d0 [btrfs] btrfs_work_helper+0x2f9/0x650 [btrfs] process_one_work+0x22c/0x600 worker_thread+0x50/0x3b0 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 -> #0 (&delayed_node->mutex){+.+.}-{3:3}: check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 __mutex_lock+0xa0/0xaf0 __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 kthread+0x137/0x150 ret_from_fork+0x1f/0x30 other info that might help us debug this: Chain exists of: &delayed_node->mutex --> &fs_info->chunk_mutex --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(&fs_info->chunk_mutex); lock(fs_reclaim); lock(&delayed_node->mutex); *** DEADLOCK *** 3 locks held by kswapd0/75: #0: ffffffff8b0c8040 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x5/0x30 #1: ffffffff8b0b50b8 (shrinker_rwsem){++++}-{3:3}, at: shrink_slab+0x54/0x2e0 #2: ffffa040e057c0e8 (&type->s_umount_key#26){++++}-{3:3}, at: trylock_super+0x16/0x50 stack backtrace: CPU: 2 PID: 75 Comm: kswapd0 Not tainted 5.8.0-rc6-default+ #1191 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 Call Trace: dump_stack+0x78/0xa0 check_noncircular+0x16f/0x190 check_prev_add+0x98/0xa20 validate_chain+0xa8c/0x2a00 __lock_acquire+0x56f/0xaa0 lock_acquire+0xa3/0x440 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __mutex_lock+0xa0/0xaf0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? __lock_acquire+0x56f/0xaa0 ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] ? lock_acquire+0xa3/0x440 ? btrfs_evict_inode+0x138/0x560 [btrfs] ? btrfs_evict_inode+0x2fe/0x560 [btrfs] ? __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] __btrfs_release_delayed_node.part.0+0x3f/0x310 [btrfs] btrfs_evict_inode+0x3bf/0x560 [btrfs] evict+0xd6/0x1c0 dispose_list+0x48/0x70 prune_icache_sb+0x54/0x80 super_cache_scan+0x121/0x1a0 do_shrink_slab+0x175/0x420 shrink_slab+0xb1/0x2e0 shrink_node+0x192/0x600 balance_pgdat+0x31f/0x750 kswapd+0x206/0x510 ? _raw_spin_unlock_irqrestore+0x3e/0x50 ? finish_wait+0x90/0x90 ? balance_pgdat+0x750/0x750 kthread+0x137/0x150 ? kthread_stop+0x2a0/0x2a0 ret_from_fork+0x1f/0x30 This is because we're holding the chunk_mutex while adding this device and adding its sysfs entries. We actually hold different locks in different places when calling this function, the dev_replace semaphore for instance in dev replace, so instead of moving this call around simply wrap it's operations in NOFS. CC: stable@vger.kernel.org # 4.14+ Reported-by: David Sterba <dsterba@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fbabd4a36f |
btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases
Eric reported seeing this message while running generic/475 BTRFS: error (device dm-3) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted Full stack trace: BTRFS: error (device dm-0) in btrfs_commit_transaction:2323: errno=-5 IO failure (Error while writing out transaction) BTRFS info (device dm-0): forced readonly BTRFS warning (device dm-0): Skipping commit of aborted transaction. ------------[ cut here ]------------ BTRFS: error (device dm-0) in cleanup_transaction:1894: errno=-5 IO failure BTRFS: Transaction aborted (error -117) BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6480 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6488 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6490 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c6498 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3555 rw 0,0 sector 0x1c64c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85e8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3572 rw 0,0 sector 0x1b85f0 len 4096 err no 10 WARNING: CPU: 3 PID: 23985 at fs/btrfs/tree-log.c:3084 btrfs_sync_log+0xbc8/0xd60 [btrfs] BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4288 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4290 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d4298 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42a8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42b8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c0 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42c8 len 4096 err no 10 BTRFS warning (device dm-0): direct IO failed ino 3548 rw 0,0 sector 0x1d42d0 len 4096 err no 10 CPU: 3 PID: 23985 Comm: fsstress Tainted: G W L 5.8.0-rc4-default+ #1181 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba527-rebuilt.opensuse.org 04/01/2014 RIP: 0010:btrfs_sync_log+0xbc8/0xd60 [btrfs] RSP: 0018:ffff909a44d17bd0 EFLAGS: 00010286 RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000001 RDX: ffff8f3be41cb940 RSI: ffffffffb0108d2b RDI: ffffffffb0108ff7 RBP: ffff909a44d17e70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000037988 R12: ffff8f3bd20e4000 R13: ffff8f3bd20e4428 R14: 00000000ffffff8b R15: ffff909a44d17c70 FS: 00007f6a6ed3fb80(0000) GS:ffff8f3c3dc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f6a6ed3e000 CR3: 00000000525c0003 CR4: 0000000000160ee0 Call Trace: ? finish_wait+0x90/0x90 ? __mutex_unlock_slowpath+0x45/0x2a0 ? lock_acquire+0xa3/0x440 ? lockref_put_or_lock+0x9/0x30 ? dput+0x20/0x4a0 ? dput+0x20/0x4a0 ? do_raw_spin_unlock+0x4b/0xc0 ? _raw_spin_unlock+0x1f/0x30 btrfs_sync_file+0x335/0x490 [btrfs] do_fsync+0x38/0x70 __x64_sys_fsync+0x10/0x20 do_syscall_64+0x50/0xe0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f6a6ef1b6e3 Code: Bad RIP value. RSP: 002b:00007ffd01e20038 EFLAGS: 00000246 ORIG_RAX: 000000000000004a RAX: ffffffffffffffda RBX: 000000000007a120 RCX: 00007f6a6ef1b6e3 RDX: 00007ffd01e1ffa0 RSI: 00007ffd01e1ffa0 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000001 R09: 00007ffd01e2004c R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000009f R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 irq event stamp: 0 hardirqs last enabled at (0): [<0000000000000000>] 0x0 hardirqs last disabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last enabled at (0): [<ffffffffb007fe0b>] copy_process+0x67b/0x1b00 softirqs last disabled at (0): [<0000000000000000>] 0x0 ---[ end trace af146e0e38433456 ]--- BTRFS: error (device dm-0) in btrfs_sync_log:3084: errno=-117 Filesystem corrupted This ret came from btrfs_write_marked_extents(). If we get an aborted transaction via EIO before, we'll see it in btree_write_cache_pages() and return EUCLEAN, which gets printed as "Filesystem corrupted". Except we shouldn't be returning EUCLEAN here, we need to be returning EROFS because EUCLEAN is reserved for actual corruption, not IO errors. We are inconsistent about our handling of BTRFS_FS_STATE_ERROR elsewhere, but we want to use EROFS for this particular case. The original transaction abort has the real error code for why we ended up with an aborted transaction, all subsequent actions just need to return EROFS because they may not have a trans handle and have no idea about the original cause of the abort. After patch "btrfs: don't WARN if we abort a transaction with EROFS" the stacktrace will not be dumped either. Reported-by: Eric Sandeen <esandeen@redhat.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add full test stacktrace ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5913139343 |
btrfs: document special case error codes for fs errors
We've had some discussions about what to do in certain scenarios for error codes, specifically EUCLEAN and EROFS. Document these near the error handling code so its clear what their intentions are. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f95ebdbed4 |
btrfs: don't WARN if we abort a transaction with EROFS
If we got some sort of corruption via a read and call btrfs_handle_fs_error() we'll set BTRFS_FS_STATE_ERROR on the fs and complain. If a subsequent trans handle trips over this it'll get EROFS and then abort. However at that point we're not aborting for the original reason, we're aborting because we've been flipped read only. We do not need to WARN_ON() here. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3ebac17ce5 |
btrfs: reduce contention on log trees when logging checksums
The possibility of extents being shared (through clone and deduplication operations) requires special care when logging data checksums, to avoid having a log tree with different checksum items that cover ranges which overlap (which resulted in missing checksums after replaying a log tree). Such problems were fixed in the past by the following commits: commit |
|
|
|
b69d1ee923 |
btrfs: remove done label in writepage_delalloc
Since there is not common cleanup run after the label it makes it somewhat redundant. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
fd7fb634d6 |
btrfs: add comments for btrfs_reserve_flush_enum
This enum is the interface exposed to developers. Although we have a detailed comment explaining the whole idea of space flushing at the beginning of space-info.c, the exposed enum interface doesn't have any comment. Some corner cases, like BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL can be interrupted by fatal signals, are not explained at all. So add some simple comments for these enums as a quick reference. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
44d354abf3 |
btrfs: relocation: review the call sites which can be interrupted by signal
Since most metadata reservation calls can return -EINTR when get interrupted by fatal signal, we need to review the all the metadata reservation call sites. In relocation code, the metadata reservation happens in the following sites: - btrfs_block_rsv_refill() in merge_reloc_root() merge_reloc_root() is a pretty critical section, we don't want to be interrupted by signal, so change the flush status to BTRFS_RESERVE_FLUSH_LIMIT, so it won't get interrupted by signal. Since such change can be ENPSPC-prone, also shrink the amount of metadata to reserve least amount avoid deadly ENOSPC there. - btrfs_block_rsv_refill() in reserve_metadata_space() It calls with BTRFS_RESERVE_FLUSH_LIMIT, which won't get interrupted by signal. - btrfs_block_rsv_refill() in prepare_to_relocate() - btrfs_block_rsv_add() in prepare_to_relocate() - btrfs_block_rsv_refill() in relocate_block_group() - btrfs_delalloc_reserve_metadata() in relocate_file_extent_cluster() - btrfs_start_transaction() in relocate_block_group() - btrfs_start_transaction() in create_reloc_inode() Can be interrupted by fatal signal and we can handle it easily. For these call sites, just catch the -EINTR value in btrfs_balance() and count them as canceled. CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f3e3d9cc35 |
btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree
[BUG]
There is a bug report about bad signal timing could lead to read-only
fs during balance:
BTRFS info (device xvdb): balance: start -d -m -s
BTRFS info (device xvdb): relocating block group 73001861120 flags metadata
BTRFS info (device xvdb): found 12236 extents, stage: move data extents
BTRFS info (device xvdb): relocating block group 71928119296 flags data
BTRFS info (device xvdb): found 3 extents, stage: move data extents
BTRFS info (device xvdb): found 3 extents, stage: update data pointers
BTRFS info (device xvdb): relocating block group 60922265600 flags metadata
BTRFS: error (device xvdb) in btrfs_drop_snapshot:5505: errno=-4 unknown
BTRFS info (device xvdb): forced readonly
BTRFS info (device xvdb): balance: ended with status: -4
[CAUSE]
The direct cause is the -EINTR from the following call chain when a
fatal signal is pending:
relocate_block_group()
|- clean_dirty_subvols()
|- btrfs_drop_snapshot()
|- btrfs_start_transaction()
|- btrfs_delayed_refs_rsv_refill()
|- btrfs_reserve_metadata_bytes()
|- __reserve_metadata_bytes()
|- wait_reserve_ticket()
|- prepare_to_wait_event();
|- ticket->error = -EINTR;
Normally this behavior is fine for most btrfs_start_transaction()
callers, as they need to catch any other error, same for the signal, and
exit ASAP.
However for balance, especially for the clean_dirty_subvols() case, we're
already doing cleanup works, getting -EINTR from btrfs_drop_snapshot()
could cause a lot of unexpected problems.
From the mentioned forced read-only report, to later balance error due
to half dropped reloc trees.
[FIX]
Fix this problem by using btrfs_join_transaction() if
btrfs_drop_snapshot() is called from relocation context.
Since btrfs_join_transaction() won't get interrupted by signal, we can
continue the cleanup.
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>3
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
5cb502f4ab |
btrfs: relocation: allow signal to cancel balance
Although btrfs balance can be canceled with "btrfs balance cancel" command, it's still almost muscle memory to press Ctrl-C to cancel a long running btrfs balance. So allow btrfs balance to check signal to determine if it should exit. The cancellation points are in known location and we're only adding one more reason, so this should be safe. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
813f8a0e26 |
btrfs: raid56: remove out label in __raid56_parity_recover
There's no cleanup that occurs so we can simply return 0 directly. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f37c563bab |
btrfs: add missing check for nocow and compression inode flags
User Forza reported on IRC that some invalid combinations of file attributes are accepted by chattr. The NODATACOW and compression file flags/attributes are mutually exclusive, but they could be set by 'chattr +c +C' on an empty file. The nodatacow will be in effect because it's checked first in btrfs_run_delalloc_range. Extend the flag validation to catch the following cases: - input flags are conflicting - old and new flags are conflicting - initialize the local variable with inode flags after inode ls locked Inode attributes take precedence over mount options and are an independent setting. Nocompress would be a no-op with nodatacow, but we don't want to mix any compression-related options with nodatacow. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
4faf55b038 |
btrfs: don't traverse into the seed devices in show_devname
->show_devname currently shows the lowest devid in the list. As the seed
devices have the lowest devid in the sprouted filesystem, the userland
tool such as findmnt end up seeing seed device instead of the device from
the read-writable sprouted filesystem. As shown below.
mount /dev/sda /btrfs
mount: /btrfs: WARNING: device write-protected, mounted read-only.
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
btrfs dev add -f /dev/sdb /btrfs
umount /btrfs
mount /dev/sdb /btrfs
findmnt --output SOURCE,TARGET,UUID /btrfs
SOURCE TARGET UUID
/dev/sda /btrfs 899f7027-3e46-4626-93e7-7d4c9ad19111
All sprouts from a single seed will show the same seed device and the
same fsid. That's confusing.
This is causing problems in our prototype as there isn't any reference
to the sprout file-system(s) which is being used for actual read and
write.
This was added in the patch which implemented the show_devname in btrfs
commit
|
|
|
|
a3cf0e4342 |
btrfs: qgroup: free per-trans reserved space when a subvolume gets dropped
[BUG] Sometime fsstress could lead to qgroup warning for case like generic/013: BTRFS warning (device dm-3): qgroup 0/259 has unreleased space, type 1 rsv 81920 ------------[ cut here ]------------ WARNING: CPU: 9 PID: 24535 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:close_ctree+0x1dc/0x323 [btrfs] Call Trace: btrfs_put_super+0x15/0x17 [btrfs] generic_shutdown_super+0x72/0x110 kill_anon_super+0x18/0x30 btrfs_kill_super+0x17/0x30 [btrfs] deactivate_locked_super+0x3b/0xa0 deactivate_super+0x40/0x50 cleanup_mnt+0x135/0x190 __cleanup_mnt+0x12/0x20 task_work_run+0x64/0xb0 __prepare_exit_to_usermode+0x1bc/0x1c0 __syscall_return_slowpath+0x47/0x230 do_syscall_64+0x64/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 ---[ end trace 6c341cdf9b6cc3c1 ]--- BTRFS error (device dm-3): qgroup reserved space leaked While that subvolume 259 is no longer in that filesystem. [CAUSE] Normally per-trans qgroup reserved space is freed when a transaction is committed, in commit_fs_roots(). However for completely dropped subvolume, that subvolume is completely gone, thus is no longer in the fs_roots_radix, and its per-trans reserved qgroup will never be freed. Since the subvolume is already gone, leaked per-trans space won't cause any trouble for end users. [FIX] Just call btrfs_qgroup_free_meta_all_pertrans() before a subvolume is completely dropped. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
d60ba8de11 |
btrfs: ref-verify: fix memory leak in add_block_entry
clang static analysis flags this error
fs/btrfs/ref-verify.c:290:3: warning: Potential leak of memory pointed to by 're' [unix.Malloc]
kfree(be);
^~~~~
The problem is in this block of code:
if (root_objectid) {
struct root_entry *exist_re;
exist_re = insert_root_entry(&exist->roots, re);
if (exist_re)
kfree(re);
}
There is no 'else' block freeing when root_objectid is 0. Add the
missing kfree to the else branch.
Fixes:
|
|
|
|
d85327b1d8 |
btrfs: prefetch chunk tree leaves at mount
The whole chunk tree is read at mount time so we can utilize readahead to get the tree blocks to memory before we read the items. The idea is from Robbie, but instead of updating search slot readahead, this patch implements the chunk tree readahead manually from nodes on level 1. We've decided to do specific readahead optimizations and then unify them under a common API so we don't break everything by changing the search slot readahead logic. Higher chunk trees grow on large filesystems (many terabytes), and prefetching just level 1 seems to be sufficient. Provided example was from a 200TiB filesystem with chunk tree level 2. CC: Robbie Ko <robbieko@synology.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
49bac89768 |
btrfs: add metadata_uuid to FS_INFO ioctl
Add retrieval of the filesystem's metadata UUID to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_METADATA_UUID flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
0fb408a558 |
btrfs: add filesystem generation to FS_INFO ioctl
Add retrieval of the filesystem's generation to the fsinfo ioctl. This is driven by setting the BTRFS_FS_INFO_FLAG_GENERATION flag in btrfs_ioctl_fs_info_args::flags. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
137c541821 |
btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl
With the recent addition of filesystem checksum types other than CRC32c,
it is not anymore hard-coded which checksum type a btrfs filesystem uses.
Up to now there is no good way to read the filesystem checksum, apart from
reading the filesystem UUID and then query sysfs for the checksum type.
Add a new csum_type and csum_size fields to the BTRFS_IOC_FS_INFO ioctl
command which usually is used to query filesystem features. Also add a
flags member indicating that the kernel responded with a set csum_type and
csum_size field.
For compatibility reasons, only return the csum_type and csum_size if
the BTRFS_FS_INFO_FLAG_CSUM_INFO flag was passed to the kernel. Also
clear any unknown flags so we don't pass false positives to user-space
newer than the kernel.
To simplify further additions to the ioctl, also switch the padding to a
u8 array. Pahole was used to verify the result of this switch:
The csum members are added before flags, which might look odd, but this
is to keep the alignment requirements and not to introduce holes in the
structure.
$ pahole -C btrfs_ioctl_fs_info_args fs/btrfs/btrfs.ko
struct btrfs_ioctl_fs_info_args {
__u64 max_id; /* 0 8 */
__u64 num_devices; /* 8 8 */
__u8 fsid[16]; /* 16 16 */
__u32 nodesize; /* 32 4 */
__u32 sectorsize; /* 36 4 */
__u32 clone_alignment; /* 40 4 */
__u16 csum_type; /* 44 2 */
__u16 csum_size; /* 46 2 */
__u64 flags; /* 48 8 */
__u8 reserved[968]; /* 56 968 */
/* size: 1024, cachelines: 16, members: 10 */
};
Fixes:
|
|
|
|
adca4d945c |
btrfs: qgroup: remove ASYNC_COMMIT mechanism in favor of reserve retry-after-EDQUOT
commit
|
|
|
|
c53e965360 |
btrfs: qgroup: try to flush qgroup space when we get -EDQUOT
[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space. One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.
btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
--- tests/btrfs/153.out 2019-10-22 15:18:14.068965341 +0800
+++ xfstests-dev/results//btrfs/153.out.bad 2020-07-01 20:24:40.730000089 +0800
@@ -1,2 +1,5 @@
QA output created by 153
+pwrite: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
+/mnt/scratch/testfile2: Disk quota exceeded
Silence is golden
...
(Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad' to see the entire diff)
[CAUSE]
Since commit
|
|
|
|
263da812e8 |
btrfs: qgroup: allow to unreserve range without releasing other ranges
[PROBLEM]
Before this patch, when btrfs_qgroup_reserve_data() fails, we free all
reserved space of the changeset.
For example:
ret = btrfs_qgroup_reserve_data(inode, changeset, 0, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_1M, SZ_1M);
ret = btrfs_qgroup_reserve_data(inode, changeset, SZ_2M, SZ_1M);
If the last btrfs_qgroup_reserve_data() failed, it will release the
entire [0, 3M) range.
This behavior is kind of OK for now, as when we hit -EDQUOT, we normally
go error handling and need to release all reserved ranges anyway.
But this also means the following call is not possible:
ret = btrfs_qgroup_reserve_data();
if (ret == -EDQUOT) {
/* Do something to free some qgroup space */
ret = btrfs_qgroup_reserve_data();
}
As if the first btrfs_qgroup_reserve_data() fails, it will free all
reserved qgroup space.
[CAUSE]
This is because we release all reserved ranges when
btrfs_qgroup_reserve_data() fails.
[FIX]
This patch will implement a new function, qgroup_unreserve_range(), to
iterate through the ulist nodes, to find any nodes in the failure range,
and remove the EXTENT_QGROUP_RESERVED bits from the io_tree, and
decrease the extent_changeset::bytes_changed, so that we can revert to
previous state.
This allows later patches to retry btrfs_qgroup_reserve_data() if EDQUOT
happens.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
48aaeebe4e |
btrfs: convert block group refcount to refcount_t
We have refcount_t now with the associated library to handle refcounts, which gives us extra debugging around reference count mistakes that may be made. For example it'll warn on any transition from 0->1 or 0->-1, which is handy for noticing cases where we've messed up reference counting. Convert the block group ref counting from an atomic_t to refcount_t and use the appropriate helpers. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
60f8667b61 |
btrfs: add multi-statement protection to btrfs_set/clear_and_info macros
Multi-statement macros should be enclosed in do/while(0) block to make their use safe in single statement if conditions. All current uses of the macros are safe, so this change is for future protection. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
93c4c033ec |
btrfs: remove fail label in check_compressed_csum
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b7d2083a36 |
btrfs: raid56: don't opencode swap() in __raid_recover_end_io
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
8302586327 |
btrfs: raid56: use in_range where applicable
While at it use the opportunity to simplify find_logical_bio_stripe by reducing the scope of 'stripe_start' variable and squash the sector-to-bytes conversion on one line. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
bf28a605e6 |
btrfs: raid56: assign bio in while() when using bio_list_pop
Unify the style in the file such that return value of bio_list_pop is assigned directly in the while loop. This is in line with the rest of the kernel. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
f90ae76a5c |
btrfs: raid56: remove redundant device check in rbio_add_io_page
The merging logic is always executed if the current stripe's device is not missing. So there's no point in duplicating the check. Simply remove it, while at it reduce the scope of the 'last_end' variable. If the current stripe's device is missing we fail the stripe early on. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
608769a4e4 |
btrfs: always initialize btrfs_bio::tgtdev_map/raid_map pointers
Since btrfs_bio always contains the extra space for the tgtdev_map and raid_maps it's pointless to make the assignment iff specific conditions are met. Instead, always assign the pointers to their correct value at allocation time. To accommodate this change also move code a bit in __btrfs_map_block so that btrfs_bio::stripes array is always initialized before the raid_map, subsequently move the call to sort_parity_stripes in the 'if' building the raid_map, retaining the old behavior. To better understand the change, there are 2 aspects to this: 1. The original code is harder to grasp because the calculations for initializing raid_map/tgtdev ponters are apart from the initial allocation of memory. Having them predicated on 2 separate checks doesn't help that either... So by moving the initialisation in alloc_btrfs_bio puts everything together. 2. tgtdev/raid_maps are now always initialized despite sometimes they might be equal i.e __btrfs_map_block_for_discard calls alloc_btrfs_bio with tgtdev = 0 but their usage should be predicated on external checks i.e. just because those pointers are non-null doesn't mean they are valid per-se. And actually while taking another look at __btrfs_map_block I saw a discrepancy: Original code initialised tgtdev_map if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL) However, further down tgtdev_map is only used if the following check is true: if (dev_replace_is_ongoing && dev_replace->tgtdev != NULL && need_full_stripe(op)) e.g. the additional need_full_stripe(op) predicate is there. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ copy more details from mail discussion ] Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3092c68fc5 |
btrfs: sysfs: add bdi link to the fsid directory
Since BTRFS uses a private bdi it makes sense to create a link to this bdi under /sys/fs/btrfs/<UUID>/bdi. This allows size of read ahead to be controlled. Without this patch it's not possible to uniquely identify which bdi pertains to which btrfs filesystem in the case of multiple btrfs filesystems. It's fine to simply call sysfs_remove_link without checking if the link indeed has been created. The call path sysfs_remove_link kernfs_remove_by_name kernfs_remove_by_name_ns will simply return -ENOENT in case it doesn't exist. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
5a9472fe7f |
btrfs: increment corrupt device counter during compressed read
If a compressed read fails due to checksum error only a line is printed to dmesg, device corrupt counter is not modified. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
26056eab4b |
btrfs: remove needless ASSERT check of orig_bio in end_compressed_bio_read
compressed_bio::orig_bio is always set in btrfs_submit_compressed_read before any bio submission is performed. Since that function is always called with a valid bio it renders the ASSERT unnecessary. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
814723e0a5 |
btrfs: increment device corruption error in case of checksum error
Now that btrfs_io_bio have access to btrfs_device we can safely increment the device corruption counter on error. There is one notable exception - repair bios for raid. Since those don't go through the normal submit_stripe_bio callpath but through raid56_parity_recover thus repair bios won't have their device set. Scrub increments the corruption counter for checksum mismatch as well but does not call this function. Link: https://lore.kernel.org/linux-btrfs/4857863.FCrPRfMyHP@liv/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3eee86c8fd |
btrfs: don't check for btrfs_device::bdev in btrfs_end_bio
btrfs_map_bio ensures that all submitted bios to devices have valid btrfs_device::bdev so this check can be removed from btrfs_end_bio. This check was added in june 2012 |
|
|
|
c31efbdf23 |
btrfs: record btrfs_device directly in btrfs_io_bio
Instead of recording stripe_index and using that to access correct btrfs_device from btrfs_bio::stripes record the btrfs_device in btrfs_io_bio. This will enable endio handlers to increment device error counters on checksum errors. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
3526302f26 |
btrfs: streamline btrfs_get_io_failure_record logic
Make the function directly return a pointer to a failure record and adjust callers to handle it. Also refactor the logic inside so that the case which allocates the failure record for the first time is not handled in an 'if' arm, saving us a level of indentation. Finally make the function static as it's not used outside of extent_io.c . Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
2279a27053 |
btrfs: make get_state_failrec return failrec directly
Only failure that get_state_failrec can get is if there is no failure for the given address. There is no reason why the function should return a status code and use a separate parameter for returning the actual failure rec (if one is found). Simplify it by making the return type a pointer and return ERR_PTR value in case of errors. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
|
|
|
b90a4ab6ba |
btrfs: remove deprecated mount option subvolrootid
The option subvolrootid used to be a workaround for mounting subvolumes
and ineffective since
|
|
|
|
d801e7a355 |
btrfs: remove deprecated mount option alloc_start
The mount option alloc_start has no effect since
|
|
|
|
a93e01682e |
btrfs: remove no longer needed use of log_writers for the log root tree
When syncing the log, we used to update the log root tree without holding
neither the log_mutex of the subvolume root nor the log_mutex of log root
tree.
We used to have two critical sections delimited by the log_mutex of the
log root tree, so in the first one we incremented the log_writers of the
log root tree and on the second one we decremented it and waited for the
log_writers counter to go down to zero. This was because the update of
the log root tree happened between the two critical sections.
The use of two critical sections allowed a little bit more of parallelism
and required the use of the log_writers counter, necessary to make sure
we didn't miss any log root tree update when we have multiple tasks trying
to sync the log in parallel.
However after commit
|
|
|
|
28a9579561 |
btrfs: stop incremening log_batch for the log root tree when syncing log
We are incrementing the log_batch atomic counter of the root log tree but
we never use that counter, it's used only for the log trees of subvolume
roots. We started doing it when we moved the log_batch and log_write
counters from the global, per fs, btrfs_fs_info structure, into the
btrfs_root structure in commit
|
|
|
|
5aa7d1a7f4 |
btrfs: only commit delayed items at fsync if we are logging a directory
When logging an inode we are committing its delayed items if either the
inode is a directory or if it is a new inode, created in the current
transaction.
We need to do it for directories, since new directory indexes are stored
as delayed items of the inode and when logging a directory we need to be
able to access all indexes from the fs/subvolume tree in order to figure
out which index ranges need to be logged.
However for new inodes that are not directories, we do not need to do it
because the only type of delayed item they can have is the inode item, and
we are guaranteed to always log an up to date version of the inode item:
*) for a full fsync we do it by committing the delayed inode and then
copying the item from the fs/subvolume tree with
copy_inode_items_to_log();
*) for a fast fsync we always log the inode item based on the contents of
the in-memory struct btrfs_inode. We guarantee this is always done since
commit
|