linux

Commit Graph

Author	SHA1	Message	Date
Jens Axboe	d47c670061	xfs: flag as supporting FOP_DONTCACHE Read side was already fully supported, and with the write side appropriately punted to the worker queue, all that's needed now is setting FOP_DONTCACHE in the file_operations structure to enable full support for read and write uncached IO. This provides similar benefits to using RWF_DONTCACHE with reads. Testing buffered writes on 32 files: writing bs 65536, uncached 0 1s: 196035MB/sec 2s: 132308MB/sec 3s: 132438MB/sec 4s: 116528MB/sec 5s: 103898MB/sec 6s: 108893MB/sec 7s: 99678MB/sec 8s: 106545MB/sec 9s: 106826MB/sec 10s: 101544MB/sec 11s: 111044MB/sec 12s: 124257MB/sec 13s: 116031MB/sec 14s: 114540MB/sec 15s: 115011MB/sec 16s: 115260MB/sec 17s: 116068MB/sec 18s: 116096MB/sec where it's quite obvious where the page cache filled, and performance dropped from to about half of where it started, settling in at around 115GB/sec. Meanwhile, 32 kswapds were running full steam trying to reclaim pages. Running the same test with uncached buffered writes: writing bs 65536, uncached 1 1s: 198974MB/sec 2s: 189618MB/sec 3s: 193601MB/sec 4s: 188582MB/sec 5s: 193487MB/sec 6s: 188341MB/sec 7s: 194325MB/sec 8s: 188114MB/sec 9s: 192740MB/sec 10s: 189206MB/sec 11s: 193442MB/sec 12s: 189659MB/sec 13s: 191732MB/sec 14s: 190701MB/sec 15s: 191789MB/sec 16s: 191259MB/sec 17s: 190613MB/sec 18s: 191951MB/sec and the behavior is fully predictable, performing the same throughout even after the page cache would otherwise have fully filled with dirty data. It's also about 65% faster, and using half the CPU of the system compared to the normal buffered write. Signed-off-by: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/20250204184047.356762-3-axboe@kernel.dk Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-27 11:21:00 +01:00
Christoph Hellwig	9b47d37496	xfs: remove the XBF_STALE check from xfs_buf_rele_cached xfs_buf_stale already set b_lru_ref to 0, and thus prevents the buffer from moving to the LRU. Remove the duplicate check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	0d1120b9bb	xfs: remove most in-flight buffer accounting The buffer cache keeps a bt_io_count per-CPU counter to track all in-flight I/O, which is used to ensure no I/O is in flight when unmounting the file system. For most I/O we already keep track of inflight I/O at higher levels: - for synchronous I/O (xfs_buf_read/xfs_bwrite/xfs_buf_delwri_submit), the caller has a reference and waits for I/O completions using xfs_buf_iowait - for xfs_buf_delwri_submit_nowait the only caller (AIL writeback) tracks the log items that the buffer attached to This only leaves only xfs_buf_readahead_map as a submitter of asynchronous I/O that is not tracked by anything else. Replace the bt_io_count per-cpu counter with a more specific bt_readahead_count counter only tracking readahead I/O. This allows to simply increment it when submitting readahead I/O and decrementing it when it completed, and thus simplify xfs_buf_rele and remove the needed for the XBF_NO_IOACCT flags and the XFS_BSTATE_IN_FLIGHT buffer state. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	efc5f7a9f3	xfs: decouple buffer readahead from the normal buffer read path xfs_buf_readahead_map is the only caller of xfs_buf_read_map and thus _xfs_buf_read that is not synchronous. Split it from xfs_buf_read_map so that the asynchronous path is self-contained and the now purely synchronous xfs_buf_read_map / _xfs_buf_read implementation can be simplified. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	4b90de5bc0	xfs: reduce context switches for synchronous buffered I/O Currently all metadata I/O completions happen in the m_buf_workqueue workqueue. But for synchronous I/O (i.e. all buffer reads) there is no need for that, as there always is a called in process context that is waiting for the I/O. Factor out the guts of xfs_buf_ioend into a separate helper and call it from xfs_buf_iowait to avoid a double an extra context switch to the workqueue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-25 13:05:59 +01:00
Christoph Hellwig	2d873efd17	xfs: flush inodegc before swapon Fix the brand new xfstest that tries to swapon on a recently unshared file and use the chance to document the other bit of magic in this function. The big comment is taken from a mailinglist post by Dave Chinner. Fixes: `5e672cd69f` ("xfs: introduce xfs_inodegc_push()") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Christoph Hellwig	3cd6a8056f	xfs: rename xfs_iomap_swapfile_activate to xfs_vm_swap_activate Match the method name and the naming convention or address_space operations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Carlos Maiolino	9f0902091c	xfs: Do not allow norecovery mount with quotacheck Mounting a filesystem that requires quota state changing will generate a transaction. We already check for a read-only device; we should do that for norecovery too. A quotacheck on a norecovery mount, and with the right log size, will cause the mount process to hang on: [<0>] xlog_grant_head_wait+0x5d/0x2a0 [xfs] [<0>] xlog_grant_head_check+0x112/0x180 [xfs] [<0>] xfs_log_reserve+0xe3/0x260 [xfs] [<0>] xfs_trans_reserve+0x179/0x250 [xfs] [<0>] xfs_trans_alloc+0x101/0x260 [xfs] [<0>] xfs_sync_sb+0x3f/0x80 [xfs] [<0>] xfs_qm_mount_quotas+0xe3/0x2f0 [xfs] [<0>] xfs_mountfs+0x7ad/0xc20 [xfs] [<0>] xfs_fs_fill_super+0x762/0xa50 [xfs] [<0>] get_tree_bdev_flags+0x131/0x1d0 [<0>] vfs_get_tree+0x26/0xd0 [<0>] vfs_cmd_create+0x59/0xe0 [<0>] __do_sys_fsconfig+0x4e3/0x6b0 [<0>] do_syscall_64+0x82/0x160 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e This is caused by a transaction running with bogus initialized head/tail I initially hit this while running generic/050, with random log sizes, but I managed to reproduce it reliably here with the steps below: mkfs.xfs -f -lsize=1025M -f -b size=4096 -m crc=1,reflink=1,rmapbt=1, -i sparse=1 /dev/vdb2 > /dev/null mount -o usrquota,grpquota,prjquota /dev/vdb2 /mnt xfs_io -x -c 'shutdown -f' /mnt umount /mnt mount -o ro,norecovery,usrquota,grpquota,prjquota /dev/vdb2 /mnt Last mount hangs up As we add yet another validation if quota state is changing, this also add a new helper named xfs_qm_validate_state_change(), factoring the quota state changes out of xfs_qm_newmount() to reduce cluttering within it. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Lukas Herbolt	9e00163c31	xfs: do not check NEEDSREPAIR if ro,norecovery mount. If there is corrutpion on the filesystem andxfs_repair fails to repair it. The last resort of getting the data is to use norecovery,ro mount. But if the NEEDSREPAIR is set the filesystem cannot be mounted. The flag must be cleared out manually using xfs_db, to get access to what left over of the corrupted fs. Signed-off-by: Lukas Herbolt <lukas@herbolt.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:35 +01:00
Darrick J. Wong	6e33017c32	xfs: fix data fork format filtering during inode repair Coverity noticed that xrep_dinode_bad_metabt_fork never runs because XFS_DINODE_FMT_META_BTREE is always filtered out in the mode selection switch of xrep_dinode_check_dfork. Metadata btrees are allowed only in the data forks of regular files, so add this case explicitly. I guess this got fubard during a refactoring prior to 6.13 and I didn't notice until now. :/ Coverity-id: 1617714 Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:40:24 +01:00
Darrick J. Wong	66314e9a57	xfs: fix online repair probing when CONFIG_XFS_ONLINE_REPAIR=n I received a report from the release engineering side of the house that xfs_scrub without the -n flag (aka fix it mode) would try to fix a broken filesystem even on a kernel that doesn't have online repair built into it: # xfs_scrub -dTvn /mnt/test EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /mnt/test: using 1 threads to scrub. Phase 1: Memory used: 132k/0k (108k/25k), time: 0.00/ 0.00/ 0.00s <snip> Phase 4: Repair filesystem. <snip> Info: /mnt/test/some/victimdir directory entries: Attempting repair. (repair.c line 351) Corruption: /mnt/test/some/victimdir directory entries: Repair unsuccessful; offline repair required. (repair.c line 204) Source: https://blogs.oracle.com/linux/post/xfs-online-filesystem-repair It is strange that xfs_scrub doesn't refuse to run, because the kernel is supposed to return EOPNOTSUPP if we actually needed to run a repair, and xfs_io's repair subcommand will perror that. And yet: # xfs_io -x -c 'repair probe' /mnt/test # The first problem is commit `dcb660f922` (4.15) which should have had xchk_probe set the CORRUPT OFLAG so that any of the repair machinery will get called at all. It turns out that some refactoring that happened in the 6.6-6.8 era broke the operation of this corner case. What we really want to happen is that all the predicates that would steer xfs_scrub_metadata() towards calling xrep_attempt() should function the same way that they do when repair is compiled in; and then xrep_attempt gets to return the fatal EOPNOTSUPP error code that causes the probe to fail. Instead, commit `8336a64eb7` (6.6) started the failwhale swimming by hoisting OFLAG checking logic into a helper whose non-repair stub always returns false, causing scrub to return "repair not needed" when in fact the repair is not supported. Prior to that commit, the oflag checking that was open-coded in scrub.c worked correctly. Similarly, in commit `4bdfd7d157` (6.8) we hoisted the IFLAG_REPAIR and ALREADY_FIXED logic into a helper whose non-repair stub always returns false, so we never enter the if test body that would have called xrep_attempt, let alone fail to decode the OFLAGs correctly. The final insult (yes, we're doing The Naked Gun now) is commit `48a72f6086` (6.8) in which we hoisted the "are we going to try a repair?" predicate into yet another function with a non-repair stub always returns false. Fix xchk_probe to trigger xrep_probe if repair is enabled, or return EOPNOTSUPP directly if it is not. For all the other scrub types, we need to fix the header predicates so that the ->repair functions (which are all xrep_notsupported) get called to return EOPNOTSUPP. Commit 48a72 is tagged here because the scrub code prior to LTS 6.12 are incomplete and not worth patching. Reported-by: David Flynn <david.flynn@oracle.com> Cc: <stable@vger.kernel.org> # v6.8 Fixes: `8336a64eb7` ("xfs: don't complain about unfixed metadata when repairs were injected") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-02-14 09:37:25 +01:00
Christoph Hellwig	ddd402bbbf	iomap: pass private data to iomap_truncate_page Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-12-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	c6d1b8d154	iomap: pass private data to iomap_zero_range Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-11-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	02b39c4655	iomap: pass private data to iomap_page_mkwrite Allow the file system to pass private data which can be used by the iomap_begin and iomap_end methods through the private pointer in the iomap_iter structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-10-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:15 +01:00
Christoph Hellwig	7102733306	iomap: simplify io_flags and io_type in struct iomap_ioend The ioend fields for distinct types of I/O are a bit complicated. Consolidate them into a single io_flag field with it's own flags decoupled from the iomap flags. This also prepares for adding a new flag that is unrelated to both of the iomap namespaces. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-3-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:13 +01:00
Christoph Hellwig	c50105933f	iomap: allow the file system to submit the writeback bios Change ->prepare_ioend to ->submit_ioend and require file systems that implement it to submit the bio. This is needed for file systems that do their own work on the bios before submitting them to the block layer like btrfs or zoned xfs. To make this easier also pass the writeback context to the method. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250206064035.2323428-2-hch@lst.de Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-02-06 13:02:13 +01:00
Linus Torvalds	0a08238acf	xfs bug fixes 6.14-rc2 Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCZ6C1iQAKCRBcsMJ8RxYu Y1k+AX9qlNEvvzfcJRP7wcqjX64B0IIpuBNd86flWIYcRUug9d/vIExKvRF2A+t7 x1kc5bsBgONUBNY+xDTMAXU58MnONDIWoy9oeP/WXGaXWelN5TKlVDhHU30gXNT1 CoDNOQ/B1w== =pduU -----END PGP SIGNATURE----- Merge tag 'xfs-fixes-6.14-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs bug fixes from Carlos Maiolino: "A few fixes for XFS, but the most notable one is: - xfs: remove xfs_buf_cache.bc_lock which has been hit by different persons including syzbot" * tag 'xfs-fixes-6.14-rc2' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: remove xfs_buf_cache.bc_lock xfs: Add error handling for xfs_reflink_cancel_cow_range xfs: Propagate errors from xfs_reflink_cancel_cow_range in xfs_dax_write_iomap_end xfs: don't call remap_verify_area with sb write protection held xfs: remove an out of data comment in _xfs_buf_alloc xfs: fix the entry condition of exact EOF block allocation optimization	2025-02-03 08:51:24 -08:00
Joel Granados	1751f872cc	treewide: const qualify ctl_tables where applicable Add the const qualifier to all the ctl_tables in the tree except for watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls, loadpin_sysctl_table and the ones calling register_net_sysctl (./net, drivers/inifiniband dirs). These are special cases as they use a registration function with a non-const qualified ctl_table argument or modify the arrays before passing them on to the registration function. Constifying ctl_table structs will prevent the modification of proc_handler function pointers as the arrays would reside in .rodata. This is made possible after commit `78eb4ea25c` ("sysctl: treewide: constify the ctl_table argument of proc_handlers") constified all the proc_handlers. Created this by running an spatch followed by a sed command: Spatch: virtual patch @ depends on !(file in "net") disable optional_qualifier @ identifier table_name != { watchdog_hardlockup_sysctl, iwcm_ctl_table, ucma_ctl_table, memory_allocation_profiling_sysctls, loadpin_sysctl_table }; @@ + const struct ctl_table table_name [] = { ... }; sed: sed --in-place \ -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \ kernel/utsname_sysctl.c Reviewed-by: Song Liu <song@kernel.org> Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/ Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs Acked-by: Jani Nikula <jani.nikula@intel.com> Acked-by: Corey Minyard <cminyard@mvista.com> Acked-by: Wei Liu <wei.liu@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com> Acked-by: Baoquan He <bhe@redhat.com> Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com> Acked-by: Anna Schumaker <anna.schumaker@oracle.com> Signed-off-by: Joel Granados <joel.granados@kernel.org>	2025-01-28 13:48:37 +01:00
Christoph Hellwig	a9ab28b3d2	xfs: remove xfs_buf_cache.bc_lock xfs_buf_cache.bc_lock serializes adding buffers to and removing them from the hashtable. But as the rhashtable code already uses fine grained internal locking for inserts and removals the extra protection isn't actually required. It also happens to fix a lock order inversion vs b_lock added by the recent lookup race fix. Fixes: `ee10f6fcdb` ("xfs: fix buffer lookup vs release race") Reported-by: Lai, Yi <yi1.lai@linux.intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-28 11:18:22 +01:00
Wentao Liang	26b63bee2f	xfs: Add error handling for xfs_reflink_cancel_cow_range In xfs_inactive(), xfs_reflink_cancel_cow_range() is called without error handling, risking unnoticed failures and inconsistent behavior compared to other parts of the code. Fix this issue by adding an error handling for the xfs_reflink_cancel_cow_range(), improving code robustness. Fixes: `6231848c3a` ("xfs: check for cow blocks before trying to clear them") Cc: stable@vger.kernel.org # v4.17 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-27 11:42:48 +01:00
Wentao Liang	fb95897b8c	xfs: Propagate errors from xfs_reflink_cancel_cow_range in xfs_dax_write_iomap_end In xfs_dax_write_iomap_end(), directly return the result of xfs_reflink_cancel_cow_range() when !written, ensuring proper error propagation and improving code robustness. Fixes: `ea6c49b784` ("xfs: support CoW in fsdax mode") Cc: stable@vger.kernel.org # v6.0 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Wentao Liang <vulab@iscas.ac.cn> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-27 11:40:57 +01:00
Linus Torvalds	9c5968db9e	The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec. - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones. - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest. - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code. - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups. - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code. - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c. - "mm/page_alloc: gfp flags cleanups for alloc_contig_()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator. - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading. - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated (https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/). Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED). - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled. - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL. - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests. - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size. - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic. - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated. - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated. - "move pagetable__dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed. - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic. - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy. - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic. - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions. - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed. - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting. - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface. - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior. - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi "introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors." - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram. - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal. - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance. - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation. - "Cleanup for memfd_create()" from Isaac Manjarres does that thing. - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration. - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices. - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ5a+cwAKCRDdBJ7gKXxA jtoyAP9R58oaOKPJuTizEKKXvh/RpMyD6sYcz/uPpnf+cKTZxQEAqfVznfWlw/Lz uC3KRZYhmd5YrxU4o+qjbzp9XWX/xAE= =Ib2s -----END PGP SIGNATURE----- Merge tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "The various patchsets are summarized below. Plus of course many indivudual patches which are described in their changelogs. - "Allocate and free frozen pages" from Matthew Wilcox reorganizes the page allocator so we end up with the ability to allocate and free zero-refcount pages. So that callers (ie, slab) can avoid a refcount inc & dec - "Support large folios for tmpfs" from Baolin Wang teaches tmpfs to use large folios other than PMD-sized ones - "Fix mm/rodata_test" from Petr Tesarik performs some maintenance and fixes for this small built-in kernel selftest - "mas_anode_descend() related cleanup" from Wei Yang tidies up part of the mapletree code - "mm: fix format issues and param types" from Keren Sun implements a few minor code cleanups - "simplify split calculation" from Wei Yang provides a few fixes and a test for the mapletree code - "mm/vma: make more mmap logic userland testable" from Lorenzo Stoakes continues the work of moving vma-related code into the (relatively) new mm/vma.c - "mm/page_alloc: gfp flags cleanups for alloc_contig_()" from David Hildenbrand cleans up and rationalizes handling of gfp flags in the page allocator - "readahead: Reintroduce fix for improper RA window sizing" from Jan Kara is a second attempt at fixing a readahead window sizing issue. It should reduce the amount of unnecessary reading - "synchronously scan and reclaim empty user PTE pages" from Qi Zheng addresses an issue where "huge" amounts of pte pagetables are accumulated: https://lore.kernel.org/lkml/cover.1718267194.git.zhengqi.arch@bytedance.com/ Qi's series addresses this windup by synchronously freeing PTE memory within the context of madvise(MADV_DONTNEED) - "selftest/mm: Remove warnings found by adding compiler flags" from Muhammad Usama Anjum fixes some build warnings in the selftests code when optional compiler warnings are enabled - "mm: don't use __GFP_HARDWALL when migrating remote pages" from David Hildenbrand tightens the allocator's observance of __GFP_HARDWALL - "pkeys kselftests improvements" from Kevin Brodsky implements various fixes and cleanups in the MM selftests code, mainly pertaining to the pkeys tests - "mm/damon: add sample modules" from SeongJae Park enhances DAMON to estimate application working set size - "memcg/hugetlb: Rework memcg hugetlb charging" from Joshua Hahn provides some cleanups to memcg's hugetlb charging logic - "mm/swap_cgroup: remove global swap cgroup lock" from Kairui Song removes the global swap cgroup lock. A speedup of 10% for a tmpfs-based kernel build was demonstrated - "zram: split page type read/write handling" from Sergey Senozhatsky has several fixes and cleaups for zram in the area of zram_write_page(). A watchdog softlockup warning was eliminated - "move pagetable__dtor() to __tlb_remove_table()" from Kevin Brodsky cleans up the pagetable destructor implementations. A rare use-after-free race is fixed - "mm/debug: introduce and use VM_WARN_ON_VMG()" from Lorenzo Stoakes simplifies and cleans up the debugging code in the VMA merging logic - "Account page tables at all levels" from Kevin Brodsky cleans up and regularizes the pagetable ctor/dtor handling. This results in improvements in accounting accuracy - "mm/damon: replace most damon_callback usages in sysfs with new core functions" from SeongJae Park cleans up and generalizes DAMON's sysfs file interface logic - "mm/damon: enable page level properties based monitoring" from SeongJae Park increases the amount of information which is presented in response to DAMOS actions - "mm/damon: remove DAMON debugfs interface" from SeongJae Park removes DAMON's long-deprecated debugfs interfaces. Thus the migration to sysfs is completed - "mm/hugetlb: Refactor hugetlb allocation resv accounting" from Peter Xu cleans up and generalizes the hugetlb reservation accounting - "mm: alloc_pages_bulk: small API refactor" from Luiz Capitulino removes a never-used feature of the alloc_pages_bulk() interface - "mm/damon: extend DAMOS filters for inclusion" from SeongJae Park extends DAMOS filters to support not only exclusion (rejecting), but also inclusion (allowing) behavior - "Add zpdesc memory descriptor for zswap.zpool" from Alex Shi introduces a new memory descriptor for zswap.zpool that currently overlaps with struct page for now. This is part of the effort to reduce the size of struct page and to enable dynamic allocation of memory descriptors - "mm, swap: rework of swap allocator locks" from Kairui Song redoes and simplifies the swap allocator locking. A speedup of 400% was demonstrated for one workload. As was a 35% reduction for kernel build time with swap-on-zram - "mm: update mips to use do_mmap(), make mmap_region() internal" from Lorenzo Stoakes reworks MIPS's use of mmap_region() so that mmap_region() can be made MM-internal - "mm/mglru: performance optimizations" from Yu Zhao fixes a few MGLRU regressions and otherwise improves MGLRU performance - "Docs/mm/damon: add tuning guide and misc updates" from SeongJae Park updates DAMON documentation - "Cleanup for memfd_create()" from Isaac Manjarres does that thing - "mm: hugetlb+THP folio and migration cleanups" from David Hildenbrand provides various cleanups in the areas of hugetlb folios, THP folios and migration - "Uncached buffered IO" from Jens Axboe implements the new RWF_DONTCACHE flag which provides synchronous dropbehind for pagecache reading and writing. To permite userspace to address issues with massive buildup of useless pagecache when reading/writing fast devices - "selftests/mm: virtual_address_range: Reduce memory" from Thomas Weißschuh fixes and optimizes some of the MM selftests" * tag 'mm-stable-2025-01-26-14-59' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (321 commits) mm/compaction: fix UBSAN shift-out-of-bounds warning s390/mm: add missing ctor/dtor on page table upgrade kasan: sw_tags: use str_on_off() helper in kasan_init_sw_tags() tools: add VM_WARN_ON_VMG definition mm/damon/core: use str_high_low() helper in damos_wmark_wait_us() seqlock: add missing parameter documentation for raw_seqcount_try_begin() mm/page-writeback: consolidate wb_thresh bumping logic into __wb_calc_thresh mm/page_alloc: remove the incorrect and misleading comment zram: remove zcomp_stream_put() from write_incompressible_page() mm: separate move/undo parts from migrate_pages_batch() mm/kfence: use str_write_read() helper in get_access_type() selftests/mm/mkdirty: fix memory leak in test_uffdio_copy() kasan: hw_tags: Use str_on_off() helper in kasan_init_hw_tags() selftests/mm: virtual_address_range: avoid reading from VM_IO mappings selftests/mm: vm_util: split up /proc/self/smaps parsing selftests/mm: virtual_address_range: unmap chunks after validation selftests/mm: virtual_address_range: mmap() without PROT_WRITE selftests/memfd/memfd_test: fix possible NULL pointer dereference mm: add FGP_DONTCACHE folio creation flag mm: call filemap_fdatawrite_range_kick() after IOCB_DONTCACHE issue ...	2025-01-26 18:36:23 -08:00
Luiz Capitulino	6bf9b5b40a	mm: alloc_pages_bulk: rename API The previous commit removed the page_list argument from alloc_pages_bulk_noprof() along with the alloc_pages_bulk_list() function. Now that only the *_array() flavour of the API remains, we can do the following renaming (along with the _noprof() ones): alloc_pages_bulk_array -> alloc_pages_bulk alloc_pages_bulk_array_mempolicy -> alloc_pages_bulk_mempolicy alloc_pages_bulk_array_node -> alloc_pages_bulk_node Link: https://lkml.kernel.org/r/275a3bbc0be20fbe9002297d60045e67ab3d4ada.1734991165.git.luizcap@redhat.com Signed-off-by: Luiz Capitulino <luizcap@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-01-25 20:22:31 -08:00
Christoph Hellwig	f5f0ed89f1	xfs: don't call remap_verify_area with sb write protection held The XFS_IOC_EXCHANGE_RANGE ioctl with the XFS_EXCHANGE_RANGE_TO_EOF flag operates on a range bounded by the end of the file. This means the actual amount of blocks exchanged is derived from the inode size, which is only stable with the IOLOCK (i_rwsem) held. Do that, it currently calls remap_verify_area from inside the sb write protection which nests outside the IOLOCK. But this makes fsnotify_file_area_perm which is called from remap_verify_area unhappy when the kernel is built with lockdep and the recently added CONFIG_FANOTIFY_ACCESS_PERMISSIONS option. Fix this by always calling remap_verify_area before taking the write protection, and passing a 0 size to remap_verify_area similar to the FICLONE/FICLONERANGE ioctls when they are asked to clone until the file end. (Note: the size argument gets passed to fsnotify_file_area_perm, but then isn't actually used there). Fixes: `9a64d9b310` ("xfs: introduce new file range exchange ioctl") Cc: <stable@vger.kernel.org> # v6.10 Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-24 12:08:50 +01:00
Christoph Hellwig	89841b2380	xfs: remove an out of data comment in _xfs_buf_alloc There hasn't been anything like an io_length for a long time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-24 12:06:18 +01:00
Jinliang Zheng	915175b49f	xfs: fix the entry condition of exact EOF block allocation optimization When we call create(), lseek() and write() sequentially, offset != 0 cannot be used as a judgment condition for whether the file already has extents. Furthermore, when xfs_bmap_adjacent() has not given a better blkno, it is not necessary to use exact EOF block allocation. Suggested-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-24 12:05:12 +01:00
Linus Torvalds	8883957b3c	\n -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEq1nRK9aeMoq1VSgcnJ2qBz9kQNkFAmePs7oACgkQnJ2qBz9k QNmHuAf9GkLnY5u1/81xP5V9ukZ4N2yeMW0dydLS5cjWj/St5ELeMAza3jeqtJtD j36vbnmy2c5pPaGLAK8BJpMXT/R2TkmmKD004zcfqF2S3SgbGzdgO1zMZzq9KJpM woRKZtLuglDajedsDEBBcKotBhlN2+C/sQlFuL1mX4zitk9ajr0qYUB1+JqOeg5f qwPsDLT077ADpxd7lVIMcm+OqbduP5KWkBKYHpn7lJcLe1eqVMMzceJroW42zhVG Dq8Iln26bbU9Wx6FSPFCUcHEzHRHUfXmu07HN9U0X++0QgWjrmBQQLooGFB/bR4a edBrPpVas6xE4/brjgFX3gOKtv8xYg== =ewDV -----END PGP SIGNATURE----- Merge tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull fsnotify pre-content notification support from Jan Kara: "This introduces a new fsnotify event (FS_PRE_ACCESS) that gets generated before a file contents is accessed. The event is synchronous so if there is listener for this event, the kernel waits for reply. On success the execution continues as usual, on failure we propagate the error to userspace. This allows userspace to fill in file content on demand from slow storage. The context in which the events are generated has been picked so that we don't hold any locks and thus there's no risk of a deadlock for the userspace handler. The new pre-content event is available only for users with global CAP_SYS_ADMIN capability (similarly to other parts of fanotify functionality) and it is an administrator responsibility to make sure the userspace event handler doesn't do stupid stuff that can DoS the system. Based on your feedback from the last submission, fsnotify code has been improved and now file->f_mode encodes whether pre-content event needs to be generated for the file so the fast path when nobody wants pre-content event for the file just grows the additional file->f_mode check. As a bonus this also removes the checks whether the old FS_ACCESS event needs to be generated from the fast path. Also the place where the event is generated during page fault has been moved so now filemap_fault() generates the event if and only if there is no uptodate folio in the page cache. Also we have dropped FS_PRE_MODIFY event as current real-world users of the pre-content functionality don't really use it so let's start with the minimal useful feature set" * tag 'fsnotify_hsm_for_v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (21 commits) fanotify: Fix crash in fanotify_init(2) fs: don't block write during exec on pre-content watched files fs: enable pre-content events on supported file systems ext4: add pre-content fsnotify hook for DAX faults btrfs: disable defrag on pre-content watched files xfs: add pre-content fsnotify hook for DAX faults fsnotify: generate pre-content permission event on page fault mm: don't allow huge faults for files with pre content watches fanotify: disable readahead if we have pre-content watches fanotify: allow to set errno in FAN_DENY permission response fanotify: report file range info with pre-content events fanotify: introduce FAN_PRE_ACCESS permission event fsnotify: generate pre-content permission event on truncate fsnotify: pass optional file access range in pre-content event fsnotify: introduce pre-content permission events fanotify: reserve event bit of deprecated FAN_DIR_MODIFY fanotify: rename a misnamed constant fanotify: don't skip extra event info if no info_mode is set fsnotify: check if file is actually being watched for pre-content events on open fsnotify: opt-in for permission events at file open time ...	2025-01-23 13:36:06 -08:00
Linus Torvalds	b477ff98d9	New XFS code for 6.14 * Implement reflink support for the realtime device * Implement reverse-mapping support for the realtime device * Several bug fixes and cleanups Signed-off-by: Carlos Maiolino <cem@kernel.org> -----BEGIN PGP SIGNATURE----- iJUEABMJAB0WIQSmtYVZ/MfVMGUq1GNcsMJ8RxYuYwUCZ490YgAKCRBcsMJ8RxYu Y0ERAX9ClltZD861L93lA30CnQwqlWNTdZDu/YB0LfPw2hM6jVIp6WLJIW7kz08q jvJMD8UBf0jRKI0I5OlD2Vw60kB1GmliUPo4QlkAzy9YNLtAg8OO2mGbEIJA21oS y3rAdzPBFg== =K947 -----END PGP SIGNATURE----- Merge tag 'xfs-merge-6.14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull XFS updates from Carlos Maiolino: "This is mostly focused on the implementation of reflink and reverse-mapping support for XFS's real-time devices. It also includes several bugfixes. - Implement reflink support for the realtime device - Implement reverse-mapping support for the realtime device - Several bug fixes and cleanups" * tag 'xfs-merge-6.14' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (121 commits) xfs: fix buffer lookup vs release race xfs: check for dead buffers in xfs_buf_find_insert xfs: add a b_iodone callback to struct xfs_buf xfs: move b_li_list based retry handling to common code xfs: simplify xfsaild_resubmit_item xfs: always complete the buffer inline in xfs_buf_submit xfs: remove the extra buffer reference in xfs_buf_submit xfs: move invalidate_kernel_vmap_range to xfs_buf_ioend xfs: simplify buffer I/O submission xfs: move in-memory buftarg handling out of _xfs_buf_ioapply xfs: move write verification out of _xfs_buf_ioapply xfs: remove xfs_buf_delwri_submit_buffers xfs: simplify xfs_buf_delwri_pushbuf xfs: move xfs_buf_iowait out of (__)xfs_buf_submit xfs: remove the incorrect comment about the b_pag field xfs: remove the incorrect comment above xfs_buf_free_maps xfs: fix a double completion for buffers on in-memory targets xfs/libxfs: replace kmalloc() and memcpy() with kmemdup() xfs: constify feature checks xfs: refactor xfs_fs_statfs ...	2025-01-23 13:06:42 -08:00
Linus Torvalds	47c9f2b3c8	vfs-6.14-rc1.statx.dio -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZ4pSTwAKCRCRxhvAZXjc oiSbAQCIWp8Jm2FX9Mv+eX8erLFlyQSDQauAnqPtW/SvMbpgFgEAuZOTSRU3GSqI NiowpYms9OckO638GlNHlSTUTcV4YwU= =obHT -----END PGP SIGNATURE----- Merge tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs direct-io updates from Christian Brauner: "File systems that write out of place usually require different alignment for direct I/O writes than what they can do for reads. Add a separate dio read align field to statx, as many out of place write file systems can easily do reads aligned to the device sector size, but require bigger alignment for writes. This is usually papered over by falling back to buffered I/O for smaller writes and doing read-modify-write cycles, but performance for this sucks, so applications benefit from knowing the actual write alignment" * tag 'vfs-6.14-rc1.statx.dio' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: xfs: report larger dio alignment for COW inodes xfs: report the correct read/write dio alignment for reflinked inodes xfs: cleanup xfs_vn_getattr fs: add STATX_DIO_READ_ALIGN fs: reformat the statx definition	2025-01-20 11:16:50 -08:00
Christoph Hellwig	ee10f6fcdb	xfs: fix buffer lookup vs release race Since commit `298f342245` ("xfs: lockless buffer lookup") the buffer lookup fastpath is done without a hash-wide lock (then pag_buf_lock, now bc_lock) and only under RCU protection. But this means that nothing serializes lookups against the temporary 0 reference count for buffers that are added to the LRU after dropping the last regular reference, and a concurrent lookup would fail to find them. Fix this by doing all b_hold modifications under b_lock. We're already doing this for release so this "only" ~ doubles the b_lock round trips. We'll later look into the lockref infrastructure to optimize the number of lock round trips again. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-16 10:19:59 +01:00
Christoph Hellwig	07eae0fa67	xfs: check for dead buffers in xfs_buf_find_insert Commit `32dd4f9c50` ("xfs: remove a superflous hash lookup when inserting new buffers") converted xfs_buf_find_insert to use rhashtable_lookup_get_insert_fast and thus an operation that returns the existing buffer when an insert would duplicate the hash key. But this code path misses the check for a buffer with a reference count of zero, which could lead to reusing an about to be freed buffer. Fix this by using the same atomic_inc_not_zero pattern as xfs_buf_insert. Fixes: `32dd4f9c50` ("xfs: remove a superflous hash lookup when inserting new buffers") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: stable@vger.kernel.org # v6.0 Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-16 10:19:59 +01:00
Christoph Hellwig	4e35be63c4	xfs: add a b_iodone callback to struct xfs_buf Stop open coding the log item completions and instead add a callback into back into the submitter. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	4f1aefd13e	xfs: move b_li_list based retry handling to common code The dquot and inode version are very similar, which is expected given the overall b_li_list logic. The differences are that the inode version also clears the XFS_LI_FLUSHING which is defined in common but only ever set by the inode item, and that the dquot version takes the ail_lock over the list iteration. While this seems sensible given that additions and removals from b_li_list are protected by the ail_lock, log items are only added before buffer submission, and are only removed when completing the buffer, so nothing can change the list when retrying a buffer. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	46eba93d4f	xfs: simplify xfsaild_resubmit_item Since commit `acc8f8628c` ("xfs: attach dquot buffer to dquot log item buffer") all buf items that use bp->b_li_list are explicitly checked for in the branch to just clears XFS_LI_FAILED. Remove the dead arm that calls xfs_clear_li_failed. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	819f29cc7b	xfs: always complete the buffer inline in xfs_buf_submit xfs_buf_submit now only completes a buffer on error, or for in-memory buftargs. There is no point in using a workqueue for the latter as the completion will just wake up the caller. Optimize this case by avoiding the workqueue roundtrip. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	6dca5abb3d	xfs: remove the extra buffer reference in xfs_buf_submit Nothing touches the buffer after it has been submitted now, so the need for the extra transient reference went away as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	5c82a471c2	xfs: move invalidate_kernel_vmap_range to xfs_buf_ioend Invalidating cache lines can be fairly expensive, so don't do it in interrupt context. Note that in practice very few setup will actually do anything here as virtually indexed caches are rather uncommon, but we might as well move the call to the proper place while touching this area. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	fac69ec8cd	xfs: simplify buffer I/O submission The code in _xfs_buf_ioapply is unnecessarily complicated because it doesn't take advantage of modern bio features. Simplify it by making use of bio splitting and chaining, that is build a single bio for the pages in the buffer using a simple loop, and then split that bio on the map boundaries for discontiguous multi-FSB buffers and chain the split bios to the main one so that there is only a single I/O completion. This not only simplifies the code to build the buffer, but also removes the need for the b_io_remaining field as buffer ownership is granted to the bio on submit of the final bio with no chance for a completion before that as well as the b_io_error field that is now superfluous because there always is exactly one completion. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:15 +01:00
Christoph Hellwig	8db65d312b	xfs: move in-memory buftarg handling out of _xfs_buf_ioapply No I/O to apply for in-memory buffers, so skip the function call entirely. Clean up the b_io_error initialization logic to allow for this. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	0195647aba	xfs: move write verification out of _xfs_buf_ioapply Split the write verification logic out of _xfs_buf_ioapply into a new xfs_buf_verify_write helper called by xfs_buf_submit given that it isn't about applying the I/O and doesn't really fit in with the rest of _xfs_buf_ioapply. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	72842dbc2b	xfs: remove xfs_buf_delwri_submit_buffers xfs_buf_delwri_submit_buffers has two callers for synchronous and asynchronous writes that share very little logic. Split out a helper for the shared per-buffer loop and otherwise open code the submission in the two callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	eb43b0b5ca	xfs: simplify xfs_buf_delwri_pushbuf xfs_buf_delwri_pushbuf synchronously writes a buffer that is on a delwri list already. Instead of doing a complicated dance with the delwri and wait list, just leave them alone and open code the actual buffer write. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	05b5968f33	xfs: move xfs_buf_iowait out of (__)xfs_buf_submit There is no good reason to pass a bool argument to wait for a buffer when the callers that want that can easily just wait themselves. This means the wait moves out of the extra hold of the buffer, but as the callers of synchronous buffer I/O need to hold a reference anyway that is perfectly fine. Because all async buffer submitters ignore the error return value, and the synchronous ones catch the error condition through b_error and xfs_buf_iowait this also means the new xfs_buf_submit doesn't have to return an error code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	411ff3f738	xfs: remove the incorrect comment about the b_pag field The rbtree root is long gone. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	83e9c69dcf	xfs: remove the incorrect comment above xfs_buf_free_maps The comment above xfs_buf_free_maps talks about fields not even used in the function and also doesn't add any other value. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Christoph Hellwig	cbd6883ed8	xfs: fix a double completion for buffers on in-memory targets __xfs_buf_submit calls xfs_buf_ioend when b_io_remaining hits zero. For in-memory buftargs b_io_remaining is never incremented from it's initial value of 1, so this always happens. Thus the extra call to xfs_buf_ioend in _xfs_buf_ioapply causes a double completion. Fortunately __xfs_buf_submit is only used for synchronous reads on in-memory buftargs due to the peculiarities of how they work, so this is mostly harmless and just causes a little extra work to be done. Fixes: `5076a6040c` ("xfs: support in-memory buffer cache targets") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-14 11:38:14 +01:00
Mirsad Todorovac	9d9b724726	xfs/libxfs: replace kmalloc() and memcpy() with kmemdup() The source static analysis tool gave the following advice: ./fs/xfs/libxfs/xfs_dir2.c:382:15-22: WARNING opportunity for kmemdup → 382 args->value = kmalloc(len, 383 GFP_KERNEL \| __GFP_NOLOCKDEP \| __GFP_RETRY_MAYFAIL); 384 if (!args->value) 385 return -ENOMEM; 386 → 387 memcpy(args->value, name, len); 388 args->valuelen = len; 389 return -EEXIST; Replacing kmalloc() + memcpy() with kmemdump() doesn't change semantics. Original code works without fault, so this is not a bug fix but proposed improvement. Link: https://lwn.net/Articles/198928/ Fixes: `94a69db236` ("xfs: use __GFP_NOLOCKDEP instead of GFP_NOFS") Fixes: `384f3ced07` ("[XFS] Return case-insensitive match for dentry cache") Fixes: `2451337dd0` ("xfs: global error sign conversion") Cc: Carlos Maiolino <cem@kernel.org> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Chandan Babu R <chandanbabu@kernel.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: linux-xfs@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Mirsad Todorovac <mtodorovac69@gmail.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:58:04 +01:00
Christoph Hellwig	183d988ae9	xfs: constify feature checks They will eventually be needed to be const for zoned growfs, but even now having such simpler helpers as const as possible is a good thing. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:57:08 +01:00
Christoph Hellwig	dd324cb79e	xfs: refactor xfs_fs_statfs Split out helpers for data, rt data and inode related information, and assigning f_bavail once instead of in three places. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:30 +01:00
Christoph Hellwig	72843ca624	xfs: don't take m_sb_lock in xfs_fs_statfs The only non-constant value read under m_sb_lock in xfs_fs_statfs is sb_dblocks, and it could become stale right after dropping the lock anyway. Remove the thus pointless lock section. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:24 +01:00
Christoph Hellwig	f4752daf47	xfs: fix the comment above xfs_discard_endio pagb_lock has been replaced with eb_lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:20 +01:00
Long Li	09f7680dea	xfs: remove bp->b_error check in xfs_attr3_root_inactive The b_error check right after xfs_trans_get_buf() is redundant: 1) If the buffer is found in transaction via xfs_trans_buf_item_match(), any corrupted metadata error would have already been exposed during previous reads like xfs_da3_node_read(). 2) If the buffer is obtained via xfs_buf_get_map(): - It's called without XBF_READ flag, so won't return buffer with b_error set, since xfs_buf_get_map() will clear it anyway. - Buffer found in cache normally won't have error since previous reads had checked it, unless someone corrupts the buffer and the AIL pushes it out to disk while the buffer's unlocked. But in this case, AIL will shut down the log. Remove this redundant check to simplify the code, make the code consistent with most other xfs_trans_get_buf() callers in XFS. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:15 +01:00
Long Li	adcaff355b	xfs: remove redundant update for ticket->t_curr_res in xfs_log_ticket_regrant The current reservation of the log ticket has already been updated a few lines above in xfs_log_ticket_regrant(), so there is no need to update it again. This is just a code cleanup with no functional changes. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:09 +01:00
Long Li	99fc33d16b	xfs: clean up xfs_end_ioend() to reuse local variables Use already initialized local variables 'offset' and 'size' instead of accessing ioend members directly in xfs_setfilesize() call. This is just a code cleanup with no functional changes. Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:56:02 +01:00
Long Li	efebe42d95	xfs: fix mount hang during primary superblock recovery failure When mounting an image containing a log with sb modifications that require log replay, the mount process hang all the time and stack as follows: [root@localhost ~]# cat /proc/557/stack [<0>] xfs_buftarg_wait+0x31/0x70 [<0>] xfs_buftarg_drain+0x54/0x350 [<0>] xfs_mountfs+0x66e/0xe80 [<0>] xfs_fs_fill_super+0x7f1/0xec0 [<0>] get_tree_bdev_flags+0x186/0x280 [<0>] get_tree_bdev+0x18/0x30 [<0>] xfs_fs_get_tree+0x1d/0x30 [<0>] vfs_get_tree+0x2d/0x110 [<0>] path_mount+0xb59/0xfc0 [<0>] do_mount+0x92/0xc0 [<0>] __x64_sys_mount+0xc2/0x160 [<0>] x64_sys_call+0x2de4/0x45c0 [<0>] do_syscall_64+0xa7/0x240 [<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e During log recovery, while updating the in-memory superblock from the primary SB buffer, if an error is encountered, such as superblock corruption occurs or some other reasons, we will proceed to out_release and release the xfs_buf. However, this is insufficient because the xfs_buf's log item has already been initialized and the xfs_buf is held by the buffer log item as follows, the xfs_buf will not be released, causing the mount thread to hang. xlog_recover_do_primary_sb_buffer xlog_recover_do_reg_buffer xlog_recover_validate_buf_type xfs_buf_item_init(bp, mp) The solution is straightforward, we simply need to allow it to be handled by the normal buffer write process. The filesystem will be shutdown before the submission of buffer_list in xlog_do_recovery_pass(), ensuring the correct release of the xfs_buf as follows: xlog_do_recovery_pass error = xlog_recover_process xlog_recover_process_data xlog_recover_process_ophdr xlog_recovery_process_trans ... xlog_recover_buf_commit_pass2 error = xlog_recover_do_primary_sb_buffer //Encounter error and return if (error) goto out_writebuf ... out_writebuf: xfs_buf_delwri_queue(bp, buffer_list) //add bp to list return error ... if (!list_empty(&buffer_list)) if (error) xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR); //shutdown first xfs_buf_delwri_submit(&buffer_list); //submit buffers in list __xfs_buf_submit if (bp->b_mount->m_log && xlog_is_shutdown(bp->b_mount->m_log)) xfs_buf_ioend_fail(bp) //release bp correctly Fixes: `6a18765b54` ("xfs: update the file system geometry after recoverying superblock buffers") Cc: stable@vger.kernel.org # v6.12 Signed-off-by: Long Li <leo.lilong@huawei.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:55:37 +01:00
Christoph Hellwig	471511d6ef	xfs: remove the t_magic field in struct xfs_trans The t_magic field is only ever assigned to, but never read. Remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:55:19 +01:00
Christoph Hellwig	415dee1e06	xfs: remove XFS_ILOG_NONCORE XFS_ILOG_NONCORE is not used in the kernel code or xfsprogs, remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:55:14 +01:00
Christoph Hellwig	23ebf63925	xfs: mark xfs_dir_isempty static And return bool instead of a boolean condition as int. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-13 14:55:06 +01:00
Carlos Maiolino	156d1c389c	xfs: reflink on the realtime device [v6.2 05/14] This patchset enables use of the file data block sharing feature (i.e. reflink) on the realtime device. It follows the same basic sequence as the realtime rmap series -- first a few cleanups; then introduction of the new btree format and inode fork format. Next comes enabling CoW and remapping for the rt device; new scrub, repair, and health reporting code; and at the end we implement some code to lengthen write requests so that rt extents are always CoWed fully. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ2nQ6gAKCRBKO3ySh0YR phxFAQDsng0DbICm7bQPkjOxCG8SGMv0DRIABYe7kmk4NJuZkQEA/mHiDSA6CLg+ bmqS9dfMRMSG/hl+9+Vp3e3CsZgMxQ4= =dWPC -----END PGP SIGNATURE----- Merge tag 'realtime-reflink_2024-12-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into for-next xfs: reflink on the realtime device [v6.2 05/14] This patchset enables use of the file data block sharing feature (i.e. reflink) on the realtime device. It follows the same basic sequence as the realtime rmap series -- first a few cleanups; then introduction of the new btree format and inode fork format. Next comes enabling CoW and remapping for the rt device; new scrub, repair, and health reporting code; and at the end we implement some code to lengthen write requests so that rt extents are always CoWed fully. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-01-13 14:54:52 +01:00
Carlos Maiolino	a938bbe473	xfs: realtime reverse-mapping support [v6.2 04/14] This is the latest revision of a patchset that adds to XFS kernel support for reverse mapping for the realtime device. This time around I've fixed some of the bitrot that I've noticed over the past few months, and most notably have converted rtrmapbt to use the metadata inode directory feature instead of burning more space in the superblock. At the beginning of the set are patches to implement storing B+tree leaves in an inode root, since the realtime rmapbt is rooted in an inode, unlike the regular rmapbt which is rooted in an AG block. Prior to this, the only btree that could be rooted in the inode fork was the block mapping btree; if all the extent records fit in the inode, format would be switched from 'btree' to 'extents'. The next few patches enhance the reverse mapping routines to handle the parts that are specific to rtgroups -- adding the new btree type, adding a new log intent item type, and wiring up the metadata directory tree entries. Finally, implement GETFSMAP with the rtrmapbt and scrub functionality for the rtrmapbt and rtbitmap and online fsck functionality. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ2nQ6gAKCRBKO3ySh0YR puNFAP4j44Z3j+WEMSONKXLgl3Goo7Xahm/teGTzSN42l7lUtQEA5zhKcvDgiNIe KXxtJcRmHCqq8qaxqcnM6vlcHok92wM= =XjiL -----END PGP SIGNATURE----- Merge tag 'realtime-rmap_2024-12-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into for-next xfs: realtime reverse-mapping support [v6.2 04/14] This is the latest revision of a patchset that adds to XFS kernel support for reverse mapping for the realtime device. This time around I've fixed some of the bitrot that I've noticed over the past few months, and most notably have converted rtrmapbt to use the metadata inode directory feature instead of burning more space in the superblock. At the beginning of the set are patches to implement storing B+tree leaves in an inode root, since the realtime rmapbt is rooted in an inode, unlike the regular rmapbt which is rooted in an AG block. Prior to this, the only btree that could be rooted in the inode fork was the block mapping btree; if all the extent records fit in the inode, format would be switched from 'btree' to 'extents'. The next few patches enhance the reverse mapping routines to handle the parts that are specific to rtgroups -- adding the new btree type, adding a new log intent item type, and wiring up the metadata directory tree entries. Finally, implement GETFSMAP with the rtrmapbt and scrub functionality for the rtrmapbt and rtbitmap and online fsck functionality. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-01-13 14:54:41 +01:00
Carlos Maiolino	8a092f440e	xfs: enable in-core block reservation for rt metadata [v6.2 03/14] In preparation for adding reverse mapping and refcounting to the realtime device, enhance the metadir code to reserve free space for btree shape changes as delayed allocation blocks. This enables us to pre-allocate space for the rmap and refcount btrees in the same manner as we do for the data device counterparts, which is how we avoid ENOSPC failures when space is low but we've already committed to a COW operation. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ2nQ6gAKCRBKO3ySh0YR pluAAP9u4A74HNTi7Df4C70diZIpXynSnyzAxVirQzda/WlUzAD+OjZEo7VkLgG4 AJIqspcX632tFBFhBAhf3XzqWlM+zws= =hOw3 -----END PGP SIGNATURE----- Merge tag 'reserve-rt-metadata-space_2024-12-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into for-next xfs: enable in-core block reservation for rt metadata [v6.2 03/14] In preparation for adding reverse mapping and refcounting to the realtime device, enhance the metadir code to reserve free space for btree shape changes as delayed allocation blocks. This enables us to pre-allocate space for the rmap and refcount btrees in the same manner as we do for the data device counterparts, which is how we avoid ENOSPC failures when space is low but we've already committed to a COW operation. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-01-13 14:54:33 +01:00
Carlos Maiolino	9a2ce7254c	xfs: refactor btrees to support records in inode root [v6.2 02/14] Amend the btree code to support storing btree rcords in the inode root, because the current bmbt code does not support this. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ2nQ6QAKCRBKO3ySh0YR piSBAP0dmPfWwKtdIx1AQUmfQ4FeNQbSk5PTKLuXIjyfUug9QwEAu7L616xtrgt6 6/9CBlIUNp3gJriZzLbCjS72F7hssgQ= =Jk+2 -----END PGP SIGNATURE----- Merge tag 'btree-ifork-records_2024-12-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into for-next xfs: refactor btrees to support records in inode root [v6.2 02/14] Amend the btree code to support storing btree rcords in the inode root, because the current bmbt code does not support this. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-01-13 14:54:23 +01:00
Carlos Maiolino	69bf6cd7f3	xfs: bug fixes for 6.13 [01/14] Bug fixes for 6.13. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZ2nQ6QAKCRBKO3ySh0YR pjVrAQD9ty5VSpobCR0HRKImR1zCvjreaTzDty+XXnCrYVka1AEA1NbC53RxAPoy bW5GJ0ZEvQhamkx9YuBfiJXoqYw1jgI= =ZmFb -----END PGP SIGNATURE----- Merge tag 'xfs-6.13-fixes_2024-12-23' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into for-next xfs: bug fixes for 6.13 [01/14] Bug fixes for 6.13. This has been running on the djcloud for months with no problems. Enjoy! Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2025-01-13 14:54:14 +01:00
Darrick J. Wong	111d36d627	xfs: lock dquot buffer before detaching dquot from b_li_list We have to lock the buffer before we can delete the dquot log item from the buffer's log item list. Cc: stable@vger.kernel.org # v6.13-rc3 Fixes: `acc8f8628c` ("xfs: attach dquot buffer to dquot log item buffer") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-10 10:12:48 +01:00
Christoph Hellwig	468210ec76	xfs: report larger dio alignment for COW inodes For I/O to reflinked blocks we always need to write an entire new file system block, and the code enforces the file system block alignment for the entire file if it has any reflinked blocks. Mirror the larger value reported in the statx in the dio_offset_align in the xfs-specific XFS_IOC_DIOINFO ioctl for the same reason. Don't bother adding a new field for the read alignment to this legacy ioctl as all new users should use statx instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250109083109.1441561-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:23:18 +01:00
Christoph Hellwig	7422bbd030	xfs: report the correct read/write dio alignment for reflinked inodes For I/O to reflinked blocks we always need to write an entire new file system block, and the code enforces the file system block alignment for the entire file if it has any reflinked blocks. Use the new STATX_DIO_READ_ALIGN flag to report the asymmetric read vs write alignments for reflinked files. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250109083109.1441561-5-hch@lst.de Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:23:18 +01:00
Christoph Hellwig	7e17483c7b	xfs: cleanup xfs_vn_getattr Split the two bits of optional statx reporting into their own helpers so that they are self-contained instead of deeply indented in the main getattr handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250109083109.1441561-4-hch@lst.de Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-01-09 16:23:17 +01:00
Christoph Hellwig	7ee7c9b39e	xfs: don't return an error from xfs_update_last_rtgroup_size for !XFS_RT Non-rtg file systems have a fake RT group even if they do not have a RT device, and thus an rgcount of 1. Ensure xfs_update_last_rtgroup_size doesn't fail when called for !XFS_RT to handle this case. Fixes: `87fe4c34a3` ("xfs: create incore realtime group structures") Reported-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>	2025-01-08 15:04:40 +01:00
Darrick J. Wong	155debbe7e	xfs: enable realtime reflink Enable reflink for realtime devices, as long as the realtime allocation unit is a single fsblock. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:17 -08:00
Darrick J. Wong	fd97fe1112	xfs: fix CoW forks for realtime files Port the copy on write fork repair to realtime files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:17 -08:00
Darrick J. Wong	12f4d20328	xfs: check for shared rt extents when rebuilding rt file's data fork When we're rebuilding the data fork of a realtime file, we need to cross-reference each mapping with the rt refcount btree to ensure that the reflink flag is set if there are any shared extents found. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	92b2019493	xfs: repair inodes that have a refcount btree in the data fork Plumb knowledge of refcount btrees into the inode core repair code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	83ccffc489	xfs: online repair of the realtime refcount btree Port the data device's refcount btree repair code to the realtime refcount btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	fe2efe9508	xfs: capture realtime CoW staging extents when rebuilding rt rmapbt Walk the realtime refcount btree to find the CoW staging extents when we're rebuilding the realtime rmap btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	477493082f	xfs: walk the rt reference count tree when rebuilding rmap When we're rebuilding the data device rmap, if we encounter a "refcount" format fork, we have to walk the (realtime) refcount btree inode to build the appropriate mappings. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	6470ceef32	xfs: check new rtbitmap records against rt refcount btree When we're rebuilding the realtime bitmap, check the proposed free extents against the rt refcount btree to make sure we don't commit any grievous errors. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:16 -08:00
Darrick J. Wong	cca34a3054	xfs: don't flag quota rt block usage on rtreflink filesystems Quota space usage is allowed to exceed the size of the physical storage when reflink is enabled. Now that we have reflink for the realtime volume, apply this same logic to the rtb repair logic. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	ca757af07f	xfs: scrub the metadir path of rt refcount btree files Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata directory tree path to the refcount btree file for each rt group. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	a9600db96f	xfs: detect and repair misaligned rtinherit directory cowextsize hints If we encounter a directory that has been configured to pass on a CoW extent size hint to a new realtime file and the hint isn't an integer multiple of the rt extent size, we should flag the hint for administrative review and/or turn it off because that is a misconfiguration. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	48bc170f2c	xfs: allow dquot rt block count to exceed rt blocks on reflink fs Update the quota scrubber to allow dquots where the realtime block count exceeds the block count of the rt volume if reflink is enabled. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	30f47950dc	xfs: check reference counts of gaps between rt refcount records If there's a gap between records in the rt refcount btree, we ought to cross-reference the gap with the rtrmap records to make sure that there aren't any overlapping records for a region that doesn't have any shared ownership. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	2d9a3e9805	xfs: allow overlapping rtrmapbt records for shared data extents Allow overlapping realtime reverse mapping records if they both describe shared data extents and the fs supports reflink on the realtime volume. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:15 -08:00
Darrick J. Wong	91683bb3f2	xfs: cross-reference checks with the rt refcount btree Use the realtime refcount btree to implement cross-reference checks in other data structures. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	c27929670d	xfs: scrub the realtime refcount btree Add code to scrub realtime refcount btrees. Similar to the refcount btree checking code for the data device, we walk the rmap btree for each refcount record to confirm that the reference counts are correct. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	026c8ed8d4	xfs: report realtime refcount btree corruption errors to the health system Whenever we encounter corrupt realtime refcount btree blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	88a70768df	xfs: check that the rtrefcount maxlevels doesn't increase when growing fs The size of filesystem transaction reservations depends on the maximum height (maxlevels) of the realtime btrees. Since we don't want a grow operation to increase the reservation size enough that we'll fail the minimum log size checks on the next mount, constrain growfs operations if they would cause an increase in the rt refcount btree maxlevels. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	8e84e8052b	xfs: enable extent size hints for CoW operations Wire up the copy-on-write extent size hint for realtime files, and connect it to the rt allocator so that we avoid fragmentation on rt filesystems. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	4de1a7ba41	xfs: apply rt extent alignment constraints to CoW extsize hint The copy-on-write extent size hint is subject to the same alignment constraints as the regular extent size hint. Since we're in the process of adding reflink (and therefore CoW) to the realtime device, we must apply the same scattered rextsize alignment validation strategies to both hints to deal with the possibility of rextsize changing. Therefore, fix the inode validator to perform rextsize alignment checks on regular realtime files, and to remove misaligned directory hints. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:14 -08:00
Darrick J. Wong	6853d23bad	xfs: fix xfs_get_extsz_hint behavior with realtime alwayscow files Currently, we (ab)use xfs_get_extsz_hint so that it always returns a nonzero value for realtime files. This apparently was done to disable delayed allocation for realtime files. However, once we enable realtime reflink, we can also turn on the alwayscow flag to force CoW writes to realtime files. In this case, the logic will incorrectly send the write through the delalloc write path. Fix this by adjusting the logic slightly. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	51e2326749	xfs: recover CoW leftovers in the realtime volume Scan the realtime refcount tree at mount time to get rid of leftover CoW staging extents. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	c3d3605f96	xfs: allow inodes to have the realtime and reflink flags Now that we can share blocks between realtime files, allow this combination. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	5519251da0	xfs: enable sharing of realtime file blocks Update the remapping routines to be able to handle realtime files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	26e97d9b4b	xfs: enable CoW for realtime data Update our write paths to support copy on write on the rt volume. This works in more or less the same way as it does on the data device, with the major exception that we never do delalloc on the rt volume. Because we consider unwritten CoW fork staging extents to be incore quota reservation, we update xfs_quota_reserve_blkres to support this case. Though xfs doesn't allow rt and quota together, the change is trivial and we shouldn't leave a logic bomb here. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	3639c63d46	xfs: refactor reflink quota updates Hoist all quota updates for reflink into a helper function, since things are about to become more complicated. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:13 -08:00
Darrick J. Wong	c2694ff678	xfs: compute rtrmap btree max levels when reflink enabled Compute the maximum possible height of the realtime rmap btree when reflink is enabled. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:12 -08:00
Darrick J. Wong	0bada82331	xfs: update rmap to allow cow staging extents in the rt rmap Don't error out on CoW staging extent records when realtime reflink is enabled. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:12 -08:00
Darrick J. Wong	4ee3113aaf	xfs: create routine to allocate and initialize a realtime refcount btree inode Create a library routine to allocate and initialize an empty realtime refcountbt inode. We'll use this for growfs, mkfs, and repair. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:12 -08:00
Darrick J. Wong	e5a171729b	xfs: wire up realtime refcount btree cursors Wire up realtime refcount btree cursors wherever they're needed throughout the code base. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:12 -08:00
Christoph Hellwig	4e87047539	xfs: refactor xfs_reflink_find_shared Move lookup of the perag structure from the callers into the helpers, and return the offset into the extent of the shared region instead of the block number that needs post-processing. This prepares the callsites for the creation of an rt-specific variant in the next patch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> [djwong: port to the middle of the rtreflink series for cleanliness] Signed-off-by: "Darrick J. Wong" <djwong@kernel.org>	2024-12-23 13:06:12 -08:00
Darrick J. Wong	f0415af60f	xfs: wire up a new metafile type for the realtime refcount Plumb in the pieces we need to embed the root of the realtime refcount btree in an inode's data fork, complete with metafile type and on-disk interpretation functions. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:12 -08:00
Darrick J. Wong	bf0b994113	xfs: add metadata reservations for realtime refcount btree Reserve some free blocks so that we will always have enough free blocks in the data volume to handle expansion of the realtime refcount btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:11 -08:00
Darrick J. Wong	eaed472c40	xfs: add realtime refcount btree inode to metadata directory Add a metadir path to select the realtime refcount btree inode and load it at mount time. The rtrefcountbt inode will have a unique extent format code, which means that we also have to update the inode validation and flush routines to look for it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:11 -08:00
Darrick J. Wong	e08d0f2004	xfs: add realtime refcount btree block detection to log recovery Identify rt refcount btree blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:11 -08:00
Darrick J. Wong	ee6d434479	xfs: support recovering refcount intent items targetting realtime extents Now that we have reflink on the realtime device, refcount intent items have to support remapping extents on the realtime volume. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:11 -08:00
Darrick J. Wong	fd9300679c	xfs: add a realtime flag to the refcount update log redo items Extend the refcount update (CUI) log items with a new realtime flag that indicates that the updates apply against the realtime refcountbt. We'll wire up the actual refcount code later. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:11 -08:00
Darrick J. Wong	01cef1db24	xfs: prepare refcount functions to deal with rtrefcountbt Prepare the high-level refcount functions to deal with the new realtime refcountbt and its slightly different conventions. Provide the ability to talk to either refcountbt or rtrefcountbt formats from the same high level code. Note that we leave the _recover_cow_leftovers functions for a separate patch so that we can convert it all at once. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	1a6f88ea53	xfs: add realtime refcount btree operations Implement the generic btree operations needed to manipulate rtrefcount btree blocks. This is different from the regular refcountbt in that we allocate space from the filesystem at large, and are neither constrained to the free space nor any particular AG. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	2003c6a875	xfs: realtime refcount btree transaction reservations Make sure that there's enough log reservation to handle mapping and unmapping realtime extents. We have to reserve enough space to handle a split in the rtrefcountbt to add the record and a second split in the regular refcountbt to record the rtrefcountbt split. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	9abe03a0e4	xfs: introduce realtime refcount btree ondisk definitions Add the ondisk structure definitions for realtime refcount btrees. The realtime refcount btree will be rooted from a hidden inode so it needs to have a separate btree block magic and pointer format. Next, add everything needed to read, write and manipulate refcount btree blocks. This prepares the way for connecting the btree operations implementation, though the changes to actually root the rtrefcount btree in an inode come later. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	70fcf68665	xfs: namespace the maximum length/refcount symbols Actually namespace these variables properly, so that readers can tell that this is an XFS symbol, and that it's for the refcount functionality. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	0d89af530c	xfs: prepare refcount btree cursor tracepoints for realtime Rework the refcount btree cursor tracepoints in preparation to handle the realtime refcount btree cursor. Mostly this involves renaming the field to "refcbno" and extracting the group number from the cursor when possible. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:10 -08:00
Darrick J. Wong	c2358439af	xfs: enable realtime rmap btree Permit mounting filesystems with realtime rmap btrees. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	799e7e6566	xfs: react to fsdax failure notifications on the rt device Now that we have reverse mapping for the realtime device, use the information to kill processes that have mappings to bad pmem. This requires refactoring the existing routines to handle rtgroups or AGs; and splitting out the translation function to improve cohesion. Also make a proper header file for the dax holder ops. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	f4ed930379	xfs: don't shut down the filesystem for media failures beyond end of log If the filesystem has an external log device on pmem and the pmem reports a media error beyond the end of the log area, don't shut down the filesystem because we don't use that space. Cc: <stable@vger.kernel.org> # v6.0 Fixes: `6f643c57d5` ("xfs: implement ->notify_failure() for XFS") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	9515572be6	xfs: hook live realtime rmap operations during a repair operation Hook the regular realtime rmap code when an rtrmapbt repair operation is running so that we can unlock the AGF buffer to scan the filesystem and keep the in-memory btree up to date during the scan. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	4a61f12eb1	xfs: create a shadow rmap btree during realtime rmap repair Create an in-memory btree of rmap records instead of an array. This enables us to do live record collection instead of freezing the fs. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	6a849bd81b	xfs: online repair of the realtime rmap btree Repair the realtime rmap btree while mounted. Similar to the regular rmap btree repair code, we walk the data fork mappings of every realtime file in the filesystem to collect reverse-mapping records in an xfarray. Then we sort the xfarray, and use the btree bulk loader to create a new rtrmap btree ondisk. Finally, we swap the btree roots, and reap the old blocks in the usual way. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:09 -08:00
Darrick J. Wong	c6904f6788	xfs: support repairing metadata btrees rooted in metadir inodes Adapt the repair code so that we can stage a new btree in the data fork area of a metadir inode and reap the old blocks. We already have nearly all of the infrastructure; the only parts that were missing were the metadata inode reservation handling. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	8defee8dff	xfs: online repair of realtime bitmaps for a realtime group For a given rt group, regenerate the bitmap contents from the group's realtime rmap btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	3dd3aba6b9	xfs: repair rmap btree inodes Teach the inode repair code how to deal with realtime rmap btree inodes that won't load properly. This is most likely moot since the filesystem generally won't mount without the rtrmapbt inodes being usable, but we'll add this for completeness. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	1bd0843027	xfs: repair inodes that have realtime extents Plumb into the inode core repair code the ability to search for extents on realtime devices. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	f1a6d9b4c3	xfs: online repair of realtime file bmaps Now that we have a reverse-mapping index of the realtime device, we can rebuild the data fork forward-mappings of any realtime file. Enhance the existing bmbt repair code to walk the rtrmap btrees to gather this information. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	2e0629e17c	xfs: walk the rt reverse mapping tree when rebuilding rmap When we're rebuilding the data device rmap, if we encounter an "rmap" format fork, we have to walk the (realtime) rmap btree inode to build the appropriate mappings. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:08 -08:00
Darrick J. Wong	366243cc99	xfs: scrub the metadir path of rt rmap btree files Add a new XFS_SCRUB_METAPATH subtype so that we can scrub the metadata directory tree path to the rmap btree file for each rt group. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:07 -08:00
Darrick J. Wong	a5542712f9	xfs: scan rt rmap when we're doing an intense rmap check of bmbt mappings Teach the bmbt scrubber how to perform a comprehensive check that the rmapbt does not contain /any/ mappings that are not described by bmbt records when it's dealing with a realtime file. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:07 -08:00
Darrick J. Wong	037a44d827	xfs: cross-reference the realtime rmapbt Teach the data fork and realtime bitmap scrubbers to cross-reference information with the realtime rmap btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:07 -08:00
Darrick J. Wong	1ebecab5ad	xfs: cross-reference realtime bitmap to realtime rmapbt scrubber When we're checking the realtime rmap btree entries, cross-reference those entries with the realtime bitmap too. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:07 -08:00
Darrick J. Wong	9a6cc4f6d0	xfs: scrub the realtime rmapbt Check the realtime reverse mapping btree against the rtbitmap, and modify the rtbitmap scrub to check against the rtrmapbt. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:07 -08:00
Darrick J. Wong	428e488465	xfs: allow queued realtime intents to drain before scrubbing When a writer thread executes a chain of log intent items for the realtime volume, the ILOCKs taken during each step are for each rt metadata file, not the entire rt volume itself. Although scrub takes all rt metadata ILOCKs, this isn't sufficient to guard against scrub checking the rt volume while that writer thread is in the middle of finishing a chain because there's no higher level locking primitive guarding the realtime volume. When there's a collision, cross-referencing between data structures (e.g. rtrmapbt and rtrefcountbt) yields false corruption events; if repair is running, this results in incorrect repairs, which is catastrophic. Fix this by adding to the mount structure the same drain that we use to protect scrub against concurrent AG updates, but this time for the realtime volume. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	6d4933c221	xfs: report realtime rmap btree corruption errors to the health system Whenever we encounter corrupt realtime rmap btree blocks, we should report that to the health monitoring system for later reporting. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	59a57acbce	xfs: check that the rtrmapbt maxlevels doesn't increase when growing fs The size of filesystem transaction reservations depends on the maximum height (maxlevels) of the realtime btrees. Since we don't want a grow operation to increase the reservation size enough that we'll fail the minimum log size checks on the next mount, constrain growfs operations if they would cause an increase in those maxlevels. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	b3683c74bf	xfs: wire up getfsmap to the realtime reverse mapping btree Connect the getfsmap ioctl to the realtime rmapbt. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	71b8acb42b	xfs: create routine to allocate and initialize a realtime rmap btree inode Create a library routine to allocate and initialize an empty realtime rmapbt inode. We'll use this for mkfs and repair. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	609a592865	xfs: wire up rmap map and unmap to the realtime rmapbt Connect the map and unmap reverse-mapping operations to the realtime rmapbt via the deferred operation callbacks. This enables us to perform rmap operations against the correct btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:06 -08:00
Darrick J. Wong	f33659e8a1	xfs: wire up a new metafile type for the realtime rmap Plumb in the pieces we need to embed the root of the realtime rmap btree in an inode's data fork, complete with new metafile type and on-disk interpretation functions. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	8491a55cfc	xfs: add metadata reservations for realtime rmap btrees Reserve some free blocks so that we will always have enough free blocks in the data volume to handle expansion of the realtime rmap btree. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	6b08901a6e	xfs: add realtime reverse map inode to metadata directory Add a metadir path to select the realtime rmap btree inode and load it at mount time. The rtrmapbt inode will have a unique extent format code, which means that we also have to update the inode validation and flush routines to look for it. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	702c90f451	xfs: support file data forks containing metadata btrees Create a new fork format type for metadata btrees. This fork type requires that the inode is in the metadata directory tree, and only applies to the data fork. The actual type of the metadata btree itself is determined by the di_metatype field. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	219ee99d36	xfs: pretty print metadata file types in error messages Create a helper function to turn a metadata file type code into a printable string, and use this to complain about lockdep problems with rtgroup inodes. We'll use this more in the next patch. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	5e0679d1c6	xfs: support recovering rmap intent items targetting realtime extents Now that we have rmap on the realtime device and rmap intent items that target the realtime device, log recovery has to support remapping extents on the realtime volume. Make this work. Identify rtrmapbt blocks in the log correctly so that we can validate them during log recovery. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:05 -08:00
Darrick J. Wong	9e823fc274	xfs: add a realtime flag to the rmap update log redo items Extend the rmap update (RUI) log items to handle realtime volumes by adding a new log intent item type. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:04 -08:00
Darrick J. Wong	adafb31c80	xfs: prepare rmap functions to deal with rtrmapbt Prepare the high-level rmap functions to deal with the new realtime rmapbt and its slightly different conventions. Provide the ability to talk to either rmapbt or rtrmapbt formats from the same high level code. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:04 -08:00
Darrick J. Wong	d386b40243	xfs: add realtime rmap btree operations Implement the generic btree operations needed to manipulate rtrmap btree blocks. This is different from the regular rmapbt in that we allocate space from the filesystem at large, and are neither constrained to the free space nor any particular AG. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:04 -08:00
Darrick J. Wong	e1c76fce50	xfs: realtime rmap btree transaction reservations Make sure that there's enough log reservation to handle mapping and unmapping realtime extents. We have to reserve enough space to handle a split in the rtrmapbt to add the record and a second split in the regular rmapbt to record the rtrmapbt split. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:04 -08:00
Darrick J. Wong	fc6856c6ff	xfs: introduce realtime rmap btree ondisk definitions Add the ondisk structure definitions for realtime rmap btrees. The realtime rmap btree will be rooted from a hidden inode so it needs to have a separate btree block magic and pointer format. Next, add everything needed to read, write and manipulate rmap btree blocks. This prepares the way for connecting the btree operations implementation, though embedding the rtrmap btree root in the inode comes later in the series. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:04 -08:00
Darrick J. Wong	05290bd5c6	xfs: allow inode-based btrees to reserve space in the data device Create a new space reservation scheme so that btree metadata for the realtime volume can reserve space in the data device to avoid space underruns. Back when we were testing the rmap and refcount btrees for the data device, people observed occasional shutdowns when xfs_btree_split was called for either of those two btrees. This happened when certain operations (mostly writeback ioends) created new rmap or refcount records, which would expand the size of the btree. If there were no free blocks available the allocation would fail and the split would shut down the filesystem. I considered pre-reserving blocks for btree expansion at the time of a write() call, but there wasn't any good way to attach the reservations to an inode and keep them there all the way to ioend processing. Unlike delalloc reservations which have that indlen mechanism, there's no way to do that for mapped extents; and indlen blocks are given back during the delalloc -> unwritten transition. The solution was to reserve sufficient blocks for rmap/refcount btree expansion at mount time. This is what the XFS_AG_RESV_* flags provide; any expansion of those two btrees can come from the pre-reserved space. This patch brings that pre-reservation ability to inode-rooted btrees so that the rt rmap and refcount btrees can also save room for future expansion. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:03 -08:00
Darrick J. Wong	2f63b20b7a	xfs: support storing records in the inode core root Add the necessary flags and code so that we can support storing leaf records in the inode root block of a btree. This hasn't been necessary before, but the realtime rmapbt will need to be able to do this. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:03 -08:00
Darrick J. Wong	953f76bf7a	xfs: simplify the xfs_rmap_{alloc,free}_extent calling conventions Simplify the calling conventions by allowing callers to pass a fsbno (xfs_fsblock_t) directly into these functions, since we're just going to set it in a struct anyway. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:03 -08:00
Darrick J. Wong	84140a96cf	xfs: prepare to reuse the dquot pointer space in struct xfs_inode Files participating in the metadata directory tree are not accounted to the quota subsystem. Therefore, the i_[ugp]dquot pointers in struct xfs_inode are never used and should always be NULL. In the next patch we want to add a u64 count of fs blocks reserved for metadata btree expansion, but we don't want every inode in the fs to pay the memory price for this feature. The intent is to union those three pointers with the u64 counter, but for that to work we must guard against all access to the dquot pointers for metadata files. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:03 -08:00
Darrick J. Wong	d415fb34b4	xfs: prepare rmap btree cursor tracepoints for realtime Rework the rmap btree cursor tracepoints in preparation to handle the realtime rmap btree cursor. Mostly this involves renaming the field to "gbno" and extracting the group number from the cursor. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-12-23 13:06:03 -08:00

1 2 3 4 5 ...

9386 Commits