- Support for userspace handling of synchronous external aborts (SEAs),
allowing the VMM to potentially handle the abort in a non-fatal
manner.
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers in
hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ.
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU.
- Allow page table destruction to reschedule, fixing long need_resched
latencies observed when destroying a large VM.
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register.
- Add AVEC basic support.
- Use 64-bit register definition for EIOINTC.
- Add KVM timer test cases for tools/selftests.
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of
exits (and worse performance) on z10 and earlier processor;
ESCA was introduced by z114/z196 in 2010.
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO SPTE
caching is disabled, as there can't be any relevant SPTEs to zap.
- Relocate a misplaced export.
- Fix an async #PF bug where KVM would clear the completion queue when the
guest transitioned in and out of paging mode, e.g. when handling an SMI and
then returning to paged mode via RSM.
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown, keeping
the notifier registered is ok; the kernel does not use the MSRs and the
callback will run cleanly and restore host MSRs if the CPU manages to
return to userspace before the system goes down.
- Use the checked version of {get,put}_user().
- Fix a long-lurking bug where KVM's lack of catch-up logic for periodic APIC
timers can result in a hard lockup in the host.
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections.
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the latter
behind CONFIG_CPU_MITIGATIONS.
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast path;
the only reason they were handled in the fast path was to paper of a bug
in the core #MC code, and that has long since been fixed.
- Add emulator support for AVX MOV instructions, to play nice with emulated
devices whose guest drivers like to access PCI BARs with large multi-byte
instructions.
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs.
- Fix the worst of KVM's lack of EFER.LMSLE emulation.
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode.
- Fix incorrect handling of selective CR0 writes when checking intercepts
during emulation of L2 instructions.
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32] on
VMRUN and #VMEXIT.
- Fix a bug where KVM corrupt the guest code stream when re-injecting a soft
interrupt if the guest patched the underlying code after the VM-Exit, e.g.
when Linux patches code with a temporary INT3.
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits to
userspace, and extend KVM "support" to all policy bits that don't require
any actual support from KVM.
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of the
current vCPU state, partly as worthwhile cleanup, but mostly to pave the
way for tracking per-root TLB flushes, and elide EPT flushes on pCPU
migration if the root is clean from a previous flush.
- Add a few missing nested consistency checks.
- Rip out support for doing "early" consistency checks via hardware as the
functionality hasn't been used in years and is no longer useful in general;
replace it with an off-by-default module param to WARN if hardware fails
a check that KVM does not perform.
- Fix a currently-benign bug where KVM would drop the guest's SPEC_CTRL[63:32]
on VM-Enter.
- Misc cleanups.
- Overhaul the TDX code to address systemic races where KVM (acting on behalf
of userspace) could inadvertantly trigger lock contention in the TDX-Module;
KVM was either working around these in weird, ugly ways, or was simply
oblivious to them (though even Yan's devilish selftests could only break
individual VMs, not the host kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a TDX vCPU,
if creating said vCPU failed partway through.
- Fix a few sparse warnings (bad annotation, 0 != NULL).
- Use struct_size() to simplify copying TDX capabilities to userspace.
- Fix a bug where TDX would effectively corrupt user-return MSR values if the
TDX Module rejects VP.ENTER and thus doesn't clobber host MSRs as expected.
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU system/VM.
- Forcefully override ARCH from x86_64 to x86 to play nice with specifying
ARCH=x86_64 on the command line.
- Extend a bunch of nested VMX to validate nested SVM as well.
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test to
verify KVM can save/restore nested VMX state when L1 is using 5-level
paging, but L2 is not.
- Clean up the guest paging code in anticipation of sharing the core logic for
nested EPT and nested NPT.
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety of
rough edges in guest_memfd along the way.
- Define a CLASS to automatically handle get+put when grabbing a guest_memfd
from a memslot to make it harder to leak references.
- Enhance KVM selftests to make it easer to develop and debug selftests like
those added for guest_memfd NUMA support, e.g. where test and/or KVM bugs
often result in hard-to-debug SIGBUS errors.
- Misc cleanups.
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU workqueue for
irqfd cleanup.
- Fix a goof in the dirty ring documentation.
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCgAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmkvMa8UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroMlFwf+Ow7zOYUuELSQ+Jn+hOYXiCNrdBDx
ZamvMU8kLPr7XX0Zog6HgcMm//qyA6k5nSfqCjfsQZrIhRA/gWJ61jz1OX/Jxq18
pJ9Vz6epnEPYiOtBwz+v8OS8MqDqVNzj2i6W1/cLPQE50c1Hhw64HWS5CSxDQiHW
A7PVfl5YU12lW1vG3uE0sNESDt4Eh/spNM17iddXdF4ZUOGublserjDGjbc17E7H
8BX3DkC2plqkJKwtjg0ae62hREkITZZc7RqsnftUkEhn0N0H9+rb6NKUyzIVh9NZ
bCtCjtrKN9zfZ0Mujnms3ugBOVqNIputu/DtPnnFKXtXWSrHrgGSNv5ewA==
=PEcw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- Support for userspace handling of synchronous external aborts
(SEAs), allowing the VMM to potentially handle the abort in a
non-fatal manner
- Large rework of the VGIC's list register handling with the goal of
supporting more active/pending IRQs than available list registers
in hardware. In addition, the VGIC now supports EOImode==1 style
deactivations for IRQs which may occur on a separate vCPU than the
one that acked the IRQ
- Support for FEAT_XNX (user / privileged execute permissions) and
FEAT_HAF (hardware update to the Access Flag) in the software page
table walkers and shadow MMU
- Allow page table destruction to reschedule, fixing long
need_resched latencies observed when destroying a large VM
- Minor fixes to KVM and selftests
Loongarch:
- Get VM PMU capability from HW GCFG register
- Add AVEC basic support
- Use 64-bit register definition for EIOINTC
- Add KVM timer test cases for tools/selftests
RISC/V:
- SBI message passing (MPXY) support for KVM guest
- Give a new, more specific error subcode for the case when in-kernel
AIA virtualization fails to allocate IMSIC VS-file
- Support KVM_DIRTY_LOG_INITIALLY_SET, enabling dirty log gradually
in small chunks
- Fix guest page fault within HLV* instructions
- Flush VS-stage TLB after VCPU migration for Andes cores
s390:
- Always allocate ESCA (Extended System Control Area), instead of
starting with the basic SCA and converting to ESCA with the
addition of the 65th vCPU. The price is increased number of exits
(and worse performance) on z10 and earlier processor; ESCA was
introduced by z114/z196 in 2010
- VIRT_XFER_TO_GUEST_WORK support
- Operation exception forwarding support
- Cleanups
x86:
- Skip the costly "zap all SPTEs" on an MMIO generation wrap if MMIO
SPTE caching is disabled, as there can't be any relevant SPTEs to
zap
- Relocate a misplaced export
- Fix an async #PF bug where KVM would clear the completion queue
when the guest transitioned in and out of paging mode, e.g. when
handling an SMI and then returning to paged mode via RSM
- Leave KVM's user-return notifier registered even when disabling
virtualization, as long as kvm.ko is loaded. On reboot/shutdown,
keeping the notifier registered is ok; the kernel does not use the
MSRs and the callback will run cleanly and restore host MSRs if the
CPU manages to return to userspace before the system goes down
- Use the checked version of {get,put}_user()
- Fix a long-lurking bug where KVM's lack of catch-up logic for
periodic APIC timers can result in a hard lockup in the host
- Revert the periodic kvmclock sync logic now that KVM doesn't use a
clocksource that's subject to NTP corrections
- Clean up KVM's handling of MMIO Stale Data and L1TF, and bury the
latter behind CONFIG_CPU_MITIGATIONS
- Context switch XCR0, XSS, and PKRU outside of the entry/exit fast
path; the only reason they were handled in the fast path was to
paper of a bug in the core #MC code, and that has long since been
fixed
- Add emulator support for AVX MOV instructions, to play nice with
emulated devices whose guest drivers like to access PCI BARs with
large multi-byte instructions
x86 (AMD):
- Fix a few missing "VMCB dirty" bugs
- Fix the worst of KVM's lack of EFER.LMSLE emulation
- Add AVIC support for addressing 4k vCPUs in x2AVIC mode
- Fix incorrect handling of selective CR0 writes when checking
intercepts during emulation of L2 instructions
- Fix a currently-benign bug where KVM would clobber SPEC_CTRL[63:32]
on VMRUN and #VMEXIT
- Fix a bug where KVM corrupt the guest code stream when re-injecting
a soft interrupt if the guest patched the underlying code after the
VM-Exit, e.g. when Linux patches code with a temporary INT3
- Add KVM_X86_SNP_POLICY_BITS to advertise supported SNP policy bits
to userspace, and extend KVM "support" to all policy bits that
don't require any actual support from KVM
x86 (Intel):
- Use the root role from kvm_mmu_page to construct EPTPs instead of
the current vCPU state, partly as worthwhile cleanup, but mostly to
pave the way for tracking per-root TLB flushes, and elide EPT
flushes on pCPU migration if the root is clean from a previous
flush
- Add a few missing nested consistency checks
- Rip out support for doing "early" consistency checks via hardware
as the functionality hasn't been used in years and is no longer
useful in general; replace it with an off-by-default module param
to WARN if hardware fails a check that KVM does not perform
- Fix a currently-benign bug where KVM would drop the guest's
SPEC_CTRL[63:32] on VM-Enter
- Misc cleanups
- Overhaul the TDX code to address systemic races where KVM (acting
on behalf of userspace) could inadvertantly trigger lock contention
in the TDX-Module; KVM was either working around these in weird,
ugly ways, or was simply oblivious to them (though even Yan's
devilish selftests could only break individual VMs, not the host
kernel)
- Fix a bug where KVM could corrupt a vCPU's cpu_list when freeing a
TDX vCPU, if creating said vCPU failed partway through
- Fix a few sparse warnings (bad annotation, 0 != NULL)
- Use struct_size() to simplify copying TDX capabilities to userspace
- Fix a bug where TDX would effectively corrupt user-return MSR
values if the TDX Module rejects VP.ENTER and thus doesn't clobber
host MSRs as expected
Selftests:
- Fix a math goof in mmu_stress_test when running on a single-CPU
system/VM
- Forcefully override ARCH from x86_64 to x86 to play nice with
specifying ARCH=x86_64 on the command line
- Extend a bunch of nested VMX to validate nested SVM as well
- Add support for LA57 in the core VM_MODE_xxx macro, and add a test
to verify KVM can save/restore nested VMX state when L1 is using
5-level paging, but L2 is not
- Clean up the guest paging code in anticipation of sharing the core
logic for nested EPT and nested NPT
guest_memfd:
- Add NUMA mempolicy support for guest_memfd, and clean up a variety
of rough edges in guest_memfd along the way
- Define a CLASS to automatically handle get+put when grabbing a
guest_memfd from a memslot to make it harder to leak references
- Enhance KVM selftests to make it easer to develop and debug
selftests like those added for guest_memfd NUMA support, e.g. where
test and/or KVM bugs often result in hard-to-debug SIGBUS errors
- Misc cleanups
Generic:
- Use the recently-added WQ_PERCPU when creating the per-CPU
workqueue for irqfd cleanup
- Fix a goof in the dirty ring documentation
- Fix choice of target for directed yield across different calls to
kvm_vcpu_on_spin(); the function was always starting from the first
vCPU instead of continuing the round-robin search"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (260 commits)
KVM: arm64: at: Update AF on software walk only if VM has FEAT_HAFDBS
KVM: arm64: at: Use correct HA bit in TCR_EL2 when regime is EL2
KVM: arm64: Document KVM_PGTABLE_PROT_{UX,PX}
KVM: arm64: Fix spelling mistake "Unexpeced" -> "Unexpected"
KVM: arm64: Add break to default case in kvm_pgtable_stage2_pte_prot()
KVM: arm64: Add endian casting to kvm_swap_s[12]_desc()
KVM: arm64: Fix compilation when CONFIG_ARM64_USE_LSE_ATOMICS=n
KVM: arm64: selftests: Add test for AT emulation
KVM: arm64: nv: Expose hardware access flag management to NV guests
KVM: arm64: nv: Implement HW access flag management in stage-2 SW PTW
KVM: arm64: Implement HW access flag management in stage-1 SW PTW
KVM: arm64: Propagate PTW errors up to AT emulation
KVM: arm64: Add helper for swapping guest descriptor
KVM: arm64: nv: Use pgtable definitions in stage-2 walk
KVM: arm64: Handle endianness in read helper for emulated PTW
KVM: arm64: nv: Stop passing vCPU through void ptr in S2 PTW
KVM: arm64: Call helper for reading descriptors directly
KVM: arm64: nv: Advertise support for FEAT_XNX
KVM: arm64: Teach ptdump about FEAT_XNX permissions
KVM: s390: Use generic VIRT_XFER_TO_GUEST_WORK functions
...
Commit 2c25716dcc ("btrfs: zlib: fix and simplify the inline extent
decompression") renamed the 'start_byte' parameter to 'dest_pgoff' in
the btrfs_decompress(). The remaining 'start_byte' references are
inconsistent with the actual implementation and may cause confusion for
developers.
Ensure consistency between function declaration and implementation.
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.
Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.
There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.
This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_compr_pool_scan(), use LIST_HEAD() to declare and initialize
the 'remove' list_head in one step instead of using INIT_LIST_HEAD()
separately.
Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folios and clear the writeback flags,
which may sleep.
However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.
Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.
There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.
The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.
However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.
The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.
Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a mempolicy parameter to filemap_alloc_folio() to enable NUMA-aware
page cache allocations. This will be used by upcoming changes to
support NUMA policies in guest-memfd, where guest_memory need to be
allocated NUMA policy specified by VMM.
All existing users pass NULL maintaining current behavior.
Reviewed-by: Pankaj Gupta <pankaj.gupta@amd.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Shivank Garg <shivankg@amd.com>
Tested-by: Ashish Kalra <ashish.kalra@amd.com>
Link: https://lore.kernel.org/r/20250827175247.83322-4-shivankg@amd.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
This includes the following preparation for bs > ps cases:
- Always alloc/free the folio directly if bs > ps
This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
affecting all compression algorithms.
For btrfs_free_compr_folio() it needs no parameter for now, as we can
use the folio size to skip the caching part.
For now the change is just to passing a @fs_info into the function,
all the folio size assumption is still based on page size.
- Properly zero the last folio in compress_file_range()
Since the compressed folios can be larger than a page, we need to
properly zero the whole folio.
- Use correct folio size for btrfs_add_compressed_bio_folios()
Instead of page size, use the correct folio size.
- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
As we are not only using simple page sized folios anymore.
- Use correct folio size for btrfs_decompress()
There is an ASSERT() making sure the decompressed range is no larger
than a page, which will be triggered for bs > ps cases.
- Skip readahead for compressed pages
Similar to subpage cases.
- Make btrfs_alloc_folio_array() to accept a new @order parameter
- Add a helper to calculate the minimal folio size
All those changes should not affect the existing bs <= ps handling.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>
Since all workspace managers are per-fs, there is no need nor no way to
store them inside btrfs_compress_op::wsm anymore.
With that said, we can do the following modifications:
- Remove zstd_workspace_mananger::ops
Zstd always grab the global btrfs_compress_op[].
- Remove btrfs_compress_op::wsm member
- Rename btrfs_compress_op to btrfs_compress_levels
This should make it more clear that btrfs_compress_levels structures are
only to indicate the levels of each compress algorithm.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since all workspaces are handled by the per-fs workspace managers, we
can safely remove the old per-module managers.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are several interfaces involved for each algorithm:
- alloc workspace
All algorithms allocate a workspace without the need for workspace
manager.
So no change needs to be done.
- get workspace
This involves checking the workspace manager to find a free one, and
if not, allocate a new one.
For none and lzo, they share the same generic btrfs_get_workspace()
helper, only needs to update that function to use the per-fs manager.
For zlib it uses a wrapper around btrfs_get_workspace(), so no special
work needed.
For zstd, update zstd_find_workspace() and zstd_get_workspace() to
utilize the per-fs manager.
- put workspace
For none/zlib/lzo they share the same btrfs_put_workspace(), update
that function to use the per-fs manager.
For zstd, it's zstd_put_workspace(), the same update.
- zstd specific timer
This is the timer to reclaim workspace, change it to grab the per-fs
workspace manager instead.
Now all workspace are managed by the per-fs manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves:
- Add (alloc|free)_workspace_manager helpers.
These are the helper to alloc/free workspace_manager structure.
The allocator will allocate a workspace_manager structure, initialize
it, and pre-allocate one workspace for it.
The freer will do the cleanup and set the manager pointer to NULL.
- Call alloc_workspace_manager() inside btrfs_alloc_compress_wsm()
- Call alloc_workspace_manager() inside btrfs_free_compress_wsm()
For none, zlib and lzo compression algorithms.
For now the generic per-fs workspace managers won't really have any effect,
and all compression is still going through the global workspace manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves:
- Add zstd_alloc_workspace_manager() and zstd_free_workspace_manager()
Those two functions will accept an fs_info pointer, and alloc/free
fs_info->compr_wsm[BTRFS_COMPRESS_ZSTD] pointer.
- Add btrfs_alloc_compress_wsm() and btrfs_free_compress_wsm()
Those are helpers allocating the workspace managers for all
algorithms.
For now only zstd is supported, and the timing is a little unusual,
the btrfs_alloc_compress_wsm() should only be called after the
sectorsize being initialized.
Meanwhile btrfs_free_fs_info_compress() is called in
btrfs_free_fs_info().
- Move the definition of btrfs_compression_type to "fs.h"
The reason is that "compression.h" has already included "fs.h", thus
we can not just include "compression.h" to get the definition of
BTRFS_NR_COMPRESS_TYPES to define fs_info::compr_wsm[].
For now the per-fs zstd workspace manager won't really have any effect,
and all compression is still going through the global workspace manager.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BACKGROUND]
Currently btrfs shares workspaces and their managers for all filesystems,
this is mostly fine as all those workspaces are using page size based
buffers, and btrfs only support block size (bs) <= page size (ps).
This means even if bs < ps, we at most waste some buffer space in the
workspace, but everything will still work fine.
The problem here is that is limiting our support for bs > ps cases.
As now a workspace now may need larger buffer to handle bs > ps cases,
but since the pool has no way to distinguish different workspaces, a
regular workspace (which is still using buffer size based on ps) can be
passed to a btrfs whose bs > ps.
In that case the buffer is not large enough, and will cause various
problems.
[ENHANCEMENT]
To prepare for the per-fs workspace migration, add an fs_info parameter
to all workspace related functions.
For now this new fs_info parameter is not yet utilized.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For the 3 supported compression algorithms, two of them (zstd and zlib)
are already grabbing the btrfs inode for error messages.
It's more common to pass btrfs_inode and grab the address space from it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you run a workload with:
- a cgroup that does tons of parallel data reading, with a working set
much larger than its memory limit
- a second cgroup that writes relatively fewer files, with overwrites,
with no memory limit
(see full code listing at the bottom for a reproducer)
Then what quickly occurs is:
- we have a large number of threads trying to read the csum tree
- we have a decent number of threads deleting csums running delayed refs
- we have a large number of threads in direct reclaim and thus high
memory pressure
The result of this is that we writeback the csum tree repeatedly mid
transaction, to get back the extent_buffer folios for reclaim. As a
result, we repeatedly COW the csum tree for the delayed refs that are
deleting csums. This means repeatedly write locking the higher levels of
the tree.
As a result of this, we achieve an unpleasant priority inversion. We
have:
- a high degree of contention on the csum root node (and other upper
nodes) eb rwsem
- a memory starved cgroup doing tons of reclaim on CPU.
- many reader threads in the memory starved cgroup "holding" the sem
as readers, but not scheduling promptly. i.e., task __state == 0, but
not running on a cpu.
- btrfs_commit_transaction stuck trying to acquire the sem as a writer.
(running delayed_refs, deleting csums for unreferenced data extents)
This results in arbitrarily long transactions. This then results in
seriously degraded performance for any cgroup using the filesystem (the
victim cgroup in the script).
It isn't an academic problem, as we see this exact problem in production
at Meta with one cgroup over its memory limit ruining btrfs performance
for the whole system, stalling critical system services that depend on
btrfs syncs.
The underlying scheduling "problem" with global rwsems is sort of thorny
and apparently well known and was discussed at LPC 2024, for example.
As a result, our main lever in the short term is just trying to reduce
contention on our various rwsems with an eye to reducing the frequency
of write locking, to avoid disabling the read lock fast acquisition path.
Luckily, it seems likely that many reads are for old extents written
many transactions ago, and that for those we *can* in fact search the
commit root. The commit_root_sem only gets taken write once, near the
end of transaction commit, no matter how much memory pressure there is,
so we have much less contention between readers and writers.
This change detects when we are trying to read an old extent (according
to extent map generation) and then wires that through bio_ctrl to the
btrfs_bio, which unfortunately isn't allocated yet when we have this
information. When we go to lookup the csums in lookup_bio_sums we can
check this condition on the btrfs_bio and do the commit root lookup
accordingly.
Note that a single bio_ctrl might collect a few extent_maps into a single
bio, so it is important to track a maximum generation across all the
extent_maps used for each bio to make an accurate decision on whether it
is valid to look in the commit root. If any extent_map is updated in the
current generation, we can't use the commit root.
To test and reproduce this issue, I used the following script and
accompanying C program (to avoid bottlenecks in constantly forking
thousands of dd processes):
====== big-read.c ======
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#include <errno.h>
#define BUF_SZ (128 * (1 << 10UL))
int read_once(int fd, size_t sz) {
char buf[BUF_SZ];
size_t rd = 0;
int ret = 0;
while (rd < sz) {
ret = read(fd, buf, BUF_SZ);
if (ret < 0) {
if (errno == EINTR)
continue;
fprintf(stderr, "read failed: %d\n", errno);
return -errno;
} else if (ret == 0) {
break;
} else {
rd += ret;
}
}
return rd;
}
int read_loop(char *fname) {
int fd;
struct stat st;
size_t sz = 0;
int ret;
while (1) {
fd = open(fname, O_RDONLY);
if (fd == -1) {
perror("open");
return 1;
}
if (!sz) {
if (!fstat(fd, &st)) {
sz = st.st_size;
} else {
perror("stat");
return 1;
}
}
ret = read_once(fd, sz);
close(fd);
}
}
int main(int argc, char *argv[]) {
int fd;
struct stat st;
off_t sz;
char *buf;
int ret;
if (argc != 2) {
fprintf(stderr, "Usage: %s <filename>\n", argv[0]);
return 1;
}
return read_loop(argv[1]);
}
====== repro.sh ======
#!/usr/bin/env bash
SCRIPT=$(readlink -f "$0")
DIR=$(dirname "$SCRIPT")
dev=$1
mnt=$2
shift
shift
CG_ROOT=/sys/fs/cgroup
BAD_CG=$CG_ROOT/bad-nbr
GOOD_CG=$CG_ROOT/good-nbr
NR_BIGGOS=1
NR_LITTLE=10
NR_VICTIMS=32
NR_VILLAINS=512
START_SEC=$(date +%s)
_elapsed() {
echo "elapsed: $(($(date +%s) - $START_SEC))"
}
_stats() {
local sysfs=/sys/fs/btrfs/$(findmnt -no UUID $dev)
echo "================"
date
_elapsed
cat $sysfs/commit_stats
cat $BAD_CG/memory.pressure
}
_setup_cgs() {
echo "+memory +cpuset" > $CG_ROOT/cgroup.subtree_control
mkdir -p $GOOD_CG
mkdir -p $BAD_CG
echo max > $BAD_CG/memory.max
# memory.high much less than the working set will cause heavy reclaim
echo $((1 << 30)) > $BAD_CG/memory.high
# victims get a subset of villain CPUs
echo 0 > $GOOD_CG/cpuset.cpus
echo 0,1,2,3 > $BAD_CG/cpuset.cpus
}
_kill_cg() {
local cg=$1
local attempts=0
echo "kill cgroup $cg"
[ -f $cg/cgroup.procs ] || return
while true; do
attempts=$((attempts + 1))
echo 1 > $cg/cgroup.kill
sleep 1
procs=$(wc -l $cg/cgroup.procs | cut -d' ' -f1)
[ $procs -eq 0 ] && break
done
rmdir $cg
echo "killed cgroup $cg in $attempts attempts"
}
_biggo_vol() {
echo $mnt/biggo_vol.$1
}
_biggo_file() {
echo $(_biggo_vol $1)/biggo
}
_subvoled_biggos() {
total_sz=$((10 << 30))
per_sz=$((total_sz / $NR_VILLAINS))
dd_count=$((per_sz >> 20))
echo "create $NR_VILLAINS subvols with a file of size $per_sz bytes for a total of $total_sz bytes."
for i in $(seq $NR_VILLAINS)
do
btrfs subvol create $(_biggo_vol $i) &>/dev/null
dd if=/dev/zero of=$(_biggo_file $i) bs=1M count=$dd_count &>/dev/null
done
echo "done creating subvols."
}
_setup() {
[ -f .done ] && rm .done
findmnt -n $dev && exit 1
if [ -f .re-mkfs ]; then
mkfs.btrfs -f -m single -d single $dev >/dev/null || exit 2
else
echo "touch .re-mkfs to populate the test fs"
fi
mount -o noatime $dev $mnt || exit 3
[ -f .re-mkfs ] && _subvoled_biggos
_setup_cgs
}
_my_cleanup() {
echo "CLEANUP!"
_kill_cg $BAD_CG
_kill_cg $GOOD_CG
sleep 1
umount $mnt
}
_bad_exit() {
_err "Unexpected Exit! $?"
_stats
exit $?
}
trap _my_cleanup EXIT
trap _bad_exit INT TERM
_setup
# Use a lot of page cache reading the big file
_villain() {
local i=$1
echo $BASHPID > $BAD_CG/cgroup.procs
$DIR/big-read $(_biggo_file $i)
}
# Hit del_csum a lot by overwriting lots of small new files
_victim() {
echo $BASHPID > $GOOD_CG/cgroup.procs
i=0;
while (true)
do
local tmp=$mnt/tmp.$i
dd if=/dev/zero of=$tmp bs=4k count=2 >/dev/null 2>&1
i=$((i+1))
[ $i -eq $NR_LITTLE ] && i=0
done
}
_one_sync() {
echo "sync..."
before=$(date +%s)
sync
after=$(date +%s)
echo "sync done in $((after - before))s"
_stats
}
# sync in a loop
_sync() {
echo "start sync loop"
syncs=0
echo $BASHPID > $GOOD_CG/cgroup.procs
while true
do
[ -f .done ] && break
_one_sync
syncs=$((syncs + 1))
[ -f .done ] && break
sleep 10
done
if [ $syncs -eq 0 ]; then
echo "do at least one sync!"
_one_sync
fi
echo "sync loop done."
}
_sleep() {
local time=${1-60}
local now=$(date +%s)
local end=$((now + time))
while [ $now -lt $end ];
do
echo "SLEEP: $((end - now))s left. Sleep 10."
sleep 10
now=$(date +%s)
done
}
echo "start $NR_VILLAINS villains"
for i in $(seq $NR_VILLAINS)
do
_villain $i &
disown # get rid of annoying log on kill (done via cgroup anyway)
done
echo "start $NR_VICTIMS victims"
for i in $(seq $NR_VICTIMS)
do
_victim &
disown
done
_sync &
SYNC_PID=$!
_sleep $1
_elapsed
touch .done
wait $SYNC_PID
echo "OK"
exit 0
Without this patch, that reproducer:
- Ran for 6+ minutes instead of 60s
- Hung hundreds of threads in D state on the csum reader lock
- Got a commit stuck for 3 minutes
sync done in 388s
================
Wed Jul 9 09:52:31 PM UTC 2025
elapsed: 420
commits 2
cur_commit_ms 0
last_commit_ms 159446
max_commit_ms 159446
total_commit_ms 160058
some avg10=99.03 avg60=98.97 avg300=75.43 total=418033386
full avg10=82.79 avg60=80.52 avg300=59.45 total=324995274
419 hits state R, D comms big-read
btrfs_tree_read_lock_nested
btrfs_read_lock_root_node
btrfs_search_slot
btrfs_lookup_csum
btrfs_lookup_bio_sums
btrfs_submit_bbio
1 hits state D comms btrfs-transacti
btrfs_tree_lock_nested
btrfs_lock_root_node
btrfs_search_slot
btrfs_del_csums
__btrfs_run_delayed_refs
btrfs_run_delayed_refs
With the patch, the reproducer exits naturally, in 65s, completing a
pretty decent 4 commits, despite heavy memory pressure. Occasionally you
can still trigger a rather long commit (couple seconds) but never one
that is minutes long.
sync done in 3s
================
elapsed: 65
commits 4
cur_commit_ms 0
last_commit_ms 485
max_commit_ms 689
total_commit_ms 2453
some avg10=98.28 avg60=64.54 avg300=19.39 total=64849893
full avg10=74.43 avg60=48.50 avg300=14.53 total=48665168
some random rwalker samples showed the most common stack in reclaim,
rather than the csum tree:
145 hits state R comms bash, sleep, dd, shuf
shrink_folio_list
shrink_lruvec
shrink_node
do_try_to_free_pages
try_to_free_mem_cgroup_pages
reclaim_high
Link: https://lpc.events/event/18/contributions/1883/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
Inspired by recent changes to compression level parsing in
6db1df415d ("btrfs: accept and ignore compression level for lzo")
it turns out that we do not do any extra validation for compression
level input string, thus allowing things like "compress=lzo:invalid" to
be accepted without warnings.
Although we accept levels that are beyond the supported algorithm
ranges, accepting completely invalid level specification is not correct.
Fix the too loose checks for compression level, by doing proper error
handling of kstrtoint(), so that we will reject not only too large
values (beyond int range) but also completely wrong levels like
"lzo:invalid".
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Any conversion of offsets in the logical or the physical mapping space
of the pages is done by a shift and the target type should be pgoff_t
(type of struct page::index). Fix the locations where it's still
unsigned long.
Signed-off-by: David Sterba <dsterba@suse.com>
Refactor the btrfs_compress_set_level() function by replacing the
nested usage of min() and max() macro with clamp() to simplify the
code and improve readability.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: George Hu <integral@archlinux.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Our message helpers accept NULL for the fs_info in the context that does
not provide and print the common header of the message. The use of pr_*
helpers is only for special reasons, like module loading, device
scanning or multi-line output (print-tree).
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We're using 'status' for the blk_status_t variables, rename 'ret' so we can
use it for generic errors.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to have a separate variable to read the bio status, 'ret'
works for that just fine so remove 'error'.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This removes the last direct poke into bvec internals in btrfs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename all the exported functions from extent_map.h that don't have a
'btrfs_' prefix in their names, so that they are consistent with all the
other functions, to make it clear they are btrfs specific functions and
to avoid potential name collisions in the future with functions defined
elsewhere in the kernel.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported and don't have a 'btrfs_' prefix in their
names, which goes against coding style conventions. Rename them to have
such prefix, making it clear they are from btrfs and avoiding potential
collisions in the future with functions defined elsewhere outside btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These functions are exported so they should have a 'btrfs_' prefix by
convention, to make it clear they are btrfs specific and to avoid
collisions with functions from elsewhere in the kernel. So add a prefix to
their name.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use the following pattern to detect if the folio contains
the end of a file:
if (folio->index == end_index)
folio_zero_range();
But that only works if the folio is page sized.
For the following case, it will not work and leave the range beyond EOF
uninitialized:
The page size is 4K, and the fs block size is also 4K.
16K 20K 24K
| | | |
|
EOF at 22K
And we have a large folio sized 8K at file offset 16K.
In that case, the old "folio->index == end_index" will not work, thus
the range [22K, 24K) will not be zeroed out.
Fix the following call sites which use the above pattern:
- add_ra_bio_pages()
- extent_writepage()
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG WITH EXPERIMENTAL LARGE FOLIOS]
When testing the experimental large data folio support with compression,
there are several ASSERT()s triggered from btrfs_decompress_buf2page()
when running fsstress with compress=zstd mount option:
- ASSERT(copy_len) from btrfs_decompress_buf2page()
- VM_BUG_ON(offset + len > PAGE_SIZE) from memcpy_to_page()
[CAUSE]
Inside btrfs_decompress_buf2page(), we need to grab the file offset from
the current bvec.bv_page, to check if we even need to copy data into the
bio.
And since we're using single page bvec, and no large folio, every page
inside the folio should have its index properly setup.
But when large folios are involved, only the first page (aka, the head
page) of a large folio has its index properly initialized.
The other pages inside the large folio will not have their indexes
properly initialized.
Thus the page_offset() call inside btrfs_decompress_buf2page() will
result garbage, and completely screw up the @copy_len calculation.
[FIX]
Instead of using page->index directly, go with page_pgoff(), which can
handle non-head pages correctly.
So introduce a helper, file_offset_from_bvec(), to get the file offset
from a single page bio_vec, so the copy_len calculation can be done
correctly.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In preparation for making the kmalloc() family of allocators type aware,
we need to make sure that the returned type from the allocation matches
the type of the variable being assigned. (Before, the allocator would
always return "void *", which can be implicitly cast to any pointer type.)
The assigned type is "struct folio **" but the returned type will be
"struct page **". These are the same allocation size (pointer size), but
the types don't match. Adjust the allocation type to match the assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Kees Cook <kees@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The zstd and zlib compression types support setting compression levels.
Enhance the defrag interface to specify the levels as well. For zstd the
negative (realtime) levels are also accepted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is no user of reader locks, rename the writer locks into a
more generic name, by removing the "_writer" part from the name.
And also rename btrfs_subpage::writer into btrfs_subpage::locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7172f52e9 ("btrfs: use per-buffer locking for
extent_buffer reading"), metadata read no longer relies on the subpage
reader locking.
This means we do not need to maintain a different metadata/data split
for locking, so we can convert the existing reader lock users by:
- add_ra_bio_pages()
Convert to btrfs_folio_set_writer_lock()
- end_folio_read()
Convert to btrfs_folio_end_writer_lock()
- begin_folio_read()
Convert to btrfs_folio_set_writer_lock()
- folio_range_has_eb()
Remove the subpage->readers checks, since it is always 0.
- Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression heuristic pass does not need a level, so we can drop the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The LZO compression has only one level, we don't need to pass the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are already two bugs (one in zlib, one in zstd) that involved
compression path is not handling sector size < page size cases well.
So it makes more sense to make sure that btrfs_compress_folios() returns
Since we already have two bugs (one in zlib, one in zstd) in the
compression path resulting the @total_in be to larger than the
to-be-compressed range length, there is enough reason to add an ASSERT()
to make sure the total read-in length doesn't exceed the input length.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Based on the previous patch, the compression path can be
directly used in folio without converting to page.
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().
Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems. For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock. This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.
On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.
On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads. This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.
Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation. Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map. This portion was contributed by Goldwyn.
Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Willy is going to get rid of page->index, and add_ra_bio_pages uses
page->index. Make his life easier by converting add_ra_bio_pages to use
folios so that we are no longer using page->index.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_alloc_folio_array() is only utilized in
btrfs_submit_compressed_read() and no other location, and the only
caller is not utilizing the @extra_gfp parameter.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>