linux

Commit Graph

Author	SHA1	Message	Date
Mikulas Patocka	225b2cb640	vdo: omit need_resched() before cond_resched() There's no need to call need_resched() because cond_resched() will do nothing if need_resched() returns false. Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-07-31 15:39:55 +02:00
Purva Yeshi	487767bff5	md: dm-zoned-target: Initialize return variable r to avoid uninitialized use Fix Smatch-detected error: drivers/md/dm-zoned-target.c:1073 dmz_iterate_devices() error: uninitialized symbol 'r'. Smatch detects a possible use of the uninitialized variable 'r' in dmz_iterate_devices() because if dmz->nr_ddevs is zero, the loop is skipped and 'r' is returned without being set, leading to undefined behavior. Initialize 'r' to 0 before the loop. This ensures that if there are no devices to iterate over, the function still returns a defined value. Signed-off-by: Purva Yeshi <purvayeshi550@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-07-31 15:39:55 +02:00
Eric Biggers	bdf253d580	dm-verity: remove support for asynchronous hashes The support for asynchronous hashes in dm-verity has outlived its usefulness. It adds significant code complexity and opportunity for bugs. I don't know of anyone using it in practice. (The original submitter of the code possibly was, but that was 8 years ago.) Data I recently collected for en/decryption shows that using off-CPU crypto "accelerators" is consistently much slower than the CPU (https://lore.kernel.org/r/20250704070322.20692-1-ebiggers@kernel.org/), even on CPUs that lack dedicated cryptographic instructions. Similar results are likely to be seen for hashing. I already removed support for asynchronous hashes from fsverity two years ago, and no one ever complained. Moreover, neither dm-verity, fsverity, nor fscrypt has ever actually used the asynchronous crypto algorithms in a truly asynchronous manner. The lack of interest in such optimizations provides further evidence that it's only the CPU-based crypto that actually matters. Historically, it's also been common for people to forget to enable the optimized SHA-256 code, which could contribute to an off-CPU crypto engine being perceived as more useful than it really is. In 6.16 I fixed that: the optimized SHA-256 code is now enabled by default. Therefore, let's drop the support for asynchronous hashes in dm-verity. Tested with verity-compat-test. Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-07-31 15:39:55 +02:00
Li Nan	907a99c314	md: rename recovery_cp to resync_offset 'recovery_cp' was used to represent the progress of sync, but its name contains recovery, which can cause confusion. Replaces 'recovery_cp' with 'resync_offset' for clarity. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250722033340.1933388-1-linan666@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-31 01:26:04 +08:00
Heming Zhao	948b1fe120	md/md-cluster: handle REMOVE message earlier Commit `a1fd37f978` ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") introduced a regression in the md_cluster module. (Failed cases 02r1_Manage_re-add & 02r10_Manage_re-add) Consider a 2-node cluster: - node1 set faulty & remove command on a disk. - node2 must correctly update the array metadata. Before `a1fd37f978`, on node1, the delay between msg:METADATA_UPDATED (triggered by faulty) and msg:REMOVE was sufficient for node2 to reload the disk info (written by node1). After `a1fd37f978`, node1 no longer waits between faulty and remove, causing it to send msg:REMOVE while node2 is still reloading disk info. This often results in node2 failing to remove the faulty disk. == how to trigger == set up a 2-node cluster (node1 & node2) with disks vdc & vdd. on node1: mdadm -CR /dev/md0 -l1 -b clustered -n2 /dev/vdc /dev/vdd --assume-clean ssh node2-ip mdadm -A /dev/md0 /dev/vdc /dev/vdd mdadm --manage /dev/md0 --fail /dev/vdc --remove /dev/vdc check array status on both nodes with "mdadm -D /dev/md0". node1 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd node2 output: Number Major Minor RaidDevice State - 0 0 0 removed 1 254 48 1 active sync /dev/vdd 0 254 32 - faulty /dev/vdc Fixes: `a1fd37f978` ("md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Link: https://lore.kernel.org/linux-raid/20250728042145.9989-1-heming.zhao@suse.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-31 01:23:19 +08:00
Yu Kuai	1df1fc845d	md: fix create on open mddev lifetime regression Commit `9e59d60976` ("md: call del_gendisk in control path") moves setting MD_DELETED from __mddev_put() to do_md_stop(), however, for the case create on open, mddev can be freed without do_md_stop(): 1) open md_probe md_alloc_and_put md_alloc mddev_alloc atomic_set(&mddev->active, 1); mddev->hold_active = UNTIL_IOCTL mddev_put atomic_dec_and_test(&mddev->active) if (mddev->hold_active) -> active is 0, hold_active is set md_open mddev_get atomic_inc(&mddev->active); 2) ioctl that is not STOP_ARRAY, for example, GET_ARRAY_INFO: md_ioctl mddev->hold_active = 0 3) close md_release mddev_put(mddev); atomic_dec_and_lock(&mddev->active, &all_mddevs_lock) __mddev_put -> hold_active is cleared, mddev will be freed queue_work(md_misc_wq, &mddev->del_work) Now that MD_DELETED is not set, before mddev is freed by mddev_delayed_delete(), md_open can still succeed and break mddev lifetime, causing mddev->kobj refcount underflow or mddev uaf problem. Fix this problem by setting MD_DELETED before queuing del_work. Reported-by: syzbot+9921e319bd6168140b40@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0012.GAE@google.com/ Reported-by: syzbot+fa3a12519f0d3fd4ec16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68894408.a00a0220.26d0e1.0013.GAE@google.com/ Fixes: `9e59d60976` ("md: call del_gendisk in control path") Link: https://lore.kernel.org/linux-raid/20250730073321.2583158-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-07-31 01:21:43 +08:00
Linus Torvalds	6e11664f14	for-6.17/block-20250728 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmiHdZ8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgptRED/9o3dQ1QHL5yNM/AyCCGox0V4zra8qGS/Vc cBWpAVrmPGRw0IYlLZENtN9PdwKcbMzJq3l6cxeC7dBnAZP0AxTzP4YYJYUNVsqo WtJ3d/k5+cVp0OyOp4uabaqNeMeLoPk9/JXe1Ml2KxtDmHtj5yee0JRh7zlPZmZj tsrpIUTeHgAPn6yR1EI+0ybx/mjCb05Mv2Y8gF5hkUPA2PuON+MTFixJmqoy2ySh n+22mz/prqlyOSYh/VVv1+9jcQ94wMjcW0JIpg9lM3Kg8BCPU4IetvO1UiX6X33v 154zEh2aJJDBx+yORS4BM4JMXjRZI7lYea2dkHM8Cajctu1Wpja9bNwnK9ibXvEc WtyBwztleLbAZef25fA/W87JE23fGa/r3nwIb2cF4QqkAFslCvhjA93WkOzNJCgQ qsWOrlCh3IK2NUu4b1Ncs3ZHOPvc51+zzjMzC6SUr54xhrxDK+gngDPhRy7XDqWJ DTMpIlr366o8GdJqnib0/e/CPBrThS6Vl6u0tgLnNbwdpK1svgo/uHW5ksKvDqHX kGEIhyRRJJC+4wyl4dsYKXa2twcyFrlWdAE+pZguEC2nZRYqYl9uXftOtvfp1x0y /skDX0FIDjvyjRqCLcqF03FSGqwCGS8WuWXZjPhVhcfz47NvbHeFDh1G/jMzsbpj S9zrPve/DQ== =e86T -----END PGP SIGNATURE----- Merge tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD pull request via Yu: - call del_gendisk synchronously (Xiao) - cleanup unused variable (John) - cleanup workqueue flags (Ryo) - fix faulty rdev can't be removed during resync (Qixing) - NVMe pull request via Christoph: - try PCIe function level reset on init failure (Keith Busch) - log TLS handshake failures at error level (Maurizio Lombardi) - pci-epf: do not complete commands twice if nvmet_req_init() fails (Rick Wertenbroek) - misc cleanups (Alok Tiwari) - Removal of the pktcdvd driver This has been more than a decade coming at this point, and some recently revealed breakages that had it causing issues even for cases where it isn't required made me re-pull the trigger on this one. It's known broken and nobody has stepped up to maintain the code - Series for ublk supporting batch commands, enabling the use of multishot where appropriate - Speed up ublk exit handling - Fix for the two-stage elevator fixing which could leak data - Convert NVMe to use the new IOVA based API - Increase default max transfer size to something more reasonable - Series fixing write operations on zoned DM devices - Add tracepoints for zoned block device operations - Prep series working towards improving blk-mq queue management in the presence of isolated CPUs - Don't allow updating of the block size of a loop device that is currently under exclusively ownership/open - Set chunk sectors from stacked device stripe size and use it for the atomic write size limit - Switch to folios in bcache read_super() - Fix for CD-ROM MRW exit flush handling - Various tweaks, fixes, and cleanups * tag 'for-6.17/block-20250728' of git://git.kernel.dk/linux: (94 commits) block: restore two stage elevator switch while running nr_hw_queue update cdrom: Call cdrom_mrw_exit from cdrom_release function sunvdc: Balance device refcount in vdc_port_mpgroup_check nvme-pci: try function level reset on init failure dm: split write BIOs on zone boundaries when zone append is not emulated block: use chunk_sectors when evaluating stacked atomic write limits dm-stripe: limit chunk_sectors to the stripe size md/raid10: set chunk_sectors limit md/raid0: set chunk_sectors limit block: sanitize chunk_sectors for atomic write limits ilog2: add max_pow_of_two_factor() nvmet: pci-epf: Do not complete commands twice if nvmet_req_init() fails nvme-tcp: log TLS handshake failures at error level docs: nvme: fix grammar in nvme-pci-endpoint-target.rst nvme: fix typo in status code constant for self-test in progress nvmet: remove redundant assignment of error code in nvmet_ns_enable() nvme: fix incorrect variable in io cqes error message nvme: fix multiple spelling and grammar issues in host drivers block: fix blk_zone_append_update_request_bio() kernel-doc md/raid10: fix set but not used variable in sync_request_write() ...	2025-07-28 16:43:54 -07:00
Linus Torvalds	cec40a7c80	vfs-6.17-rc1.integrity -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCngAKCRCRxhvAZXjc ogAMAP9LqNHFf7JfDIvF/PJBxzYa0ToWwPsWACERknwkvtBRCwEAhkmscIcIMQ4t LPGLGha17dfpaE4RurRhBYgS9x2/1Ao= =jSnJ -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs 'protection info' updates from Christian Brauner: "This adds the new FS_IOC_GETLBMD_CAP ioctl() to query metadata and protection info (PI) capabilities. This ioctl returns information about the files integrity profile. This is useful for userspace applications to understand a files end-to-end data protection support and configure the I/O accordingly. For now this interface is only supported by block devices. However the design and placement of this ioctl in generic FS ioctl space allows us to extend it to work over files as well. This maybe useful when filesystems start supporting PI-aware layouts. A new structure struct logical_block_metadata_cap is introduced, which contains the following fields: - lbmd_flags: bitmask of logical block metadata capability flags - lbmd_interval: the amount of data described by each unit of logical block metadata - lbmd_size: size in bytes of the logical block metadata associated with each interval - lbmd_opaque_size: size in bytes of the opaque block tag associated with each interval - lbmd_opaque_offset: offset in bytes of the opaque block tag within the logical block metadata - lbmd_pi_size: size in bytes of the T10 PI tuple associated with each interval - lbmd_pi_offset: offset in bytes of T10 PI tuple within the logical block metadata - lbmd_pi_guard_tag_type: T10 PI guard tag type - lbmd_pi_app_tag_size: size in bytes of the T10 PI application tag - lbmd_pi_ref_tag_size: size in bytes of the T10 PI reference tag - lbmd_pi_storage_tag_size: size in bytes of the T10 PI storage tag The internal logic to fetch the capability is encapsulated in a helper function blk_get_meta_cap(), which uses the blk_integrity profile associated with the device. The ioctl returns -EOPNOTSUPP, if CONFIG_BLK_DEV_INTEGRITY is not enabled" * tag 'vfs-6.17-rc1.integrity' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: block: fix lbmd_guard_tag_type assignment in FS_IOC_GETLBMD_CAP block: fix FS_IOC_GETLBMD_CAP parsing in blkdev_common_ioctl() fs: add ioctl to query metadata and protection info capabilities nvme: set pi_offset only when checksum type is not BLK_INTEGRITY_CSUM_NONE block: introduce pi_tuple_size field in blk_integrity block: rename tuple_size field in blk_integrity to metadata_size	2025-07-28 15:12:00 -07:00
Linus Torvalds	278c7d9b5e	vfs-6.17-rc1.fallocate -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaINCeQAKCRCRxhvAZXjc otqEAP9bWFExQtnzrNR+1s4UBfPVDAaTJzDnBWj6z0+Idw9oegEAoxF2ifdCPnR4 t/xWiM4FmSA+9pwvP3U5z3sOReDDsgo= =WMMB -----END PGP SIGNATURE----- Merge tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fallocate updates from Christian Brauner: "fallocate() currently supports creating preallocated files efficiently. However, on most filesystems fallocate() will preallocate blocks in an unwriten state even if FALLOC_FL_ZERO_RANGE is specified. The extent state must later be converted to a written state when the user writes data into this range, which can trigger numerous metadata changes and journal I/O. This may leads to significant write amplification and performance degradation in synchronous write mode. At the moment, the only method to avoid this is to create an empty file and write zero data into it (for example, using 'dd' with a large block size). However, this method is slow and consumes a considerable amount of disk bandwidth. Now that more and more flash-based storage devices are available it is possible to efficiently write zeros to SSDs using the unmap write zeroes command if the devices do not write physical zeroes to the media. For example, if SCSI SSDs support the UMMAP bit or NVMe SSDs support the DEAC bit[1], the write zeroes command does not write actual data to the device, instead, NVMe converts the zeroed range to a deallocated state, which works fast and consumes almost no disk write bandwidth. This series implements the BLK_FEAT_WRITE_ZEROES_UNMAP feature and BLK_FLAG_WRITE_ZEROES_UNMAP_DISABLED flag for SCSI, NVMe and device-mapper drivers, and add the FALLOC_FL_WRITE_ZEROES and STATX_ATTR_WRITE_ZEROES_UNMAP support for ext4 and raw bdev devices. fallocate() is subsequently extended with the FALLOC_FL_WRITE_ZEROES flag. FALLOC_FL_WRITE_ZEROES zeroes a specified file range in such a way that subsequent writes to that range do not require further changes to the file mapping metadata. This flag is beneficial for subsequent pure overwriting within this range, as it can save on block allocation and, consequently, significant metadata changes" * tag 'vfs-6.17-rc1.fallocate' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: ext4: add FALLOC_FL_WRITE_ZEROES support block: add FALLOC_FL_WRITE_ZEROES support block: factor out common part in blkdev_fallocate() fs: introduce FALLOC_FL_WRITE_ZEROES to fallocate dm: clear unmap write zeroes limits when disabling write zeroes scsi: sd: set max_hw_wzeroes_unmap_sectors if device supports SD_ZERO_*_UNMAP nvmet: set WZDS and DRB if device enables unmap write zeroes operation nvme: set max_hw_wzeroes_unmap_sectors if device supports DEAC bit block: introduce max_{hw\|user}_wzeroes_unmap_sectors to queue limits	2025-07-28 13:36:49 -07:00
Jens Axboe	c20413b799	Merge tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.17/block Pull MD updates from Yu: "- call del_gendisk synchronously, from Xiao - cleanup unused variable, from John - cleanup workqueue flags, from Ryo - fix faulty rdev can't be removed during resync, from Qixing" * tag 'md-6.17-20250722' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux: md/raid10: fix set but not used variable in sync_request_write() md: allow removing faulty rdev during resync md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueue md: remove/add redundancy group only in level change md: Don't clear MD_CLOSING until mddev is freed md: call del_gendisk in control path	2025-07-22 04:48:52 -06:00
Shin'ichiro Kawasaki	675f940576	dm: split write BIOs on zone boundaries when zone append is not emulated Commit `2df7168717` ("dm: Always split write BIOs to zoned device limits") updates the device-mapper driver to perform splits for the write BIOs. However, it did not address the cases where DM targets do not emulate zone append, such as in the cases of dm-linear or dm-flakey. For these targets, when the write BIOs span across zone boundaries, they trigger WARN_ON_ONCE(bio_straddles_zones(bio)) in blk_zone_wplug_handle_write(). This results in I/O errors. The errors are reproduced by running blktests test case zbd/004 using zoned dm-linear or dm-flakey devices. To avoid the I/O errors, handle the write BIOs regardless whether DM targets emulate zone append or not, so that all write BIOs are split at zone boundaries. For that purpose, drop the check for zone append emulation in dm_zone_bio_needs_split(). Its argument 'md' is no longer used then drop it also. Fixes: `2df7168717` ("dm: Always split write BIOs to zoned device limits") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250717103539.37279-1-shinichiro.kawasaki@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-17 06:03:03 -06:00
John Garry	5fb9d4341b	dm-stripe: limit chunk_sectors to the stripe size Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Setting chunk_sectors limit in this way overrides the stacked limit already calculated based on the bottom device limits. This is ok, as when any bios are sent to the bottom devices, the block layer will still respect the bottom device chunk_sectors. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-17 06:01:16 -06:00
John Garry	7ef50c4c6a	md/raid10: set chunk_sectors limit Same as done for raid0, set chunk_sectors limit to appropriately set the atomic write size limit. Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-17 06:01:16 -06:00
John Garry	4b8beba60d	md/raid0: set chunk_sectors limit Currently we use min io size as the chunk size when deciding on the atomic write size limits - see blk_stack_atomic_writes_head(). The limit min_io size is not a reliable value to store the chunk size, as this may be mutated by the block stacking code. Such an example would be for the min io size less than the physical block size, and the min io size is raised to the physical block size - see blk_stack_limits(). The block stacking limits will rely on chunk_sectors in future, so set this value (to the chunk size). Reviewed-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20250711105258.3135198-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-17 06:01:16 -06:00
John Garry	bc1c2f0ae3	md/raid10: fix set but not used variable in sync_request_write() Building with W=1 reports the following: drivers/md/raid10.c: In function ‘sync_request_write’: drivers/md/raid10.c:2441:21: error: variable ‘d’ set but not used [-Werror=unused-but-set-variable] 2441 \| int d; \| ^ cc1: all warnings being treated as errors Remove the usage of that variable. Fixes: `752d0464b7` ("md: clean up accounting for issued sync IO") Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/linux-raid/20250709104814.2307276-1-john.g.garry@oracle.com Signed-off-by: Yu Kuai <yukuai@kernel.org>	2025-07-17 00:02:05 +08:00
Linus Torvalds	155a3c003e	- dm-bufio: fix scheduling in atomic -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaHU0NhQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbYuhAP9E3m1AlDYfwP1ZOwv0FGXBVtiGFlNw n9HMdwmNBbiMXQD+MxhLAPfly1oot4qUHy7akqK39ANkwlWLDZgpAcI2dA0= =kh23 -----END PGP SIGNATURE----- Merge tag 'for-6.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - dm-bufio: fix scheduling in atomic * tag 'for-6.16/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-bufio: fix sched in atomic context	2025-07-14 19:25:28 -07:00
Zheng Qixing	c0ffeb6480	md: allow removing faulty rdev during resync During RAID resync, faulty rdev cannot be removed and will result in "Device or resource busy" error when attempting hot removal. Reproduction steps: mdadm -Cv /dev/md0 -l1 -n3 -e1.2 /dev/sd{b..d} mdadm /dev/md0 -f /dev/sdb mdadm /dev/md0 -r /dev/sdb -> mdadm: hot remove failed for /dev/sdb: Device or resource busy After commit `4b10a3bc67` ("md: ensure resync is prioritized over recovery"), when a device becomes faulty during resync, the md_choose_sync_action() function returns early without calling remove_and_add_spares(), preventing faulty device removal. This patch extracts a helper function remove_spares() to support removing faulty devices during RAID resync operations. Fixes: `4b10a3bc67` ("md: ensure resync is prioritized over recovery") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Reviewed-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/linux-raid/20250707075412.150301-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:55:20 +08:00
Ryo Takakura	3ec8db61e7	md/raid5: unset WQ_CPU_INTENSIVE for raid5 unbound workqueue When specified with WQ_CPU_INTENSIVE, the workqueue doesn't participate in concurrency management. This behaviour is already accounted for WQ_UNBOUND workqueues given that they are assigned to their own worker threads. Unset WQ_CPU_INTENSIVE as the use of flag has no effect when used with WQ_UNBOUND. Signed-off-by: Ryo Takakura <ryotkkr98@gmail.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/linux-raid/20250601013702.64640-1-ryotkkr98@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:52:11 +08:00
Xiao Ni	790abe4d77	md: remove/add redundancy group only in level change del_gendisk is called in synchronous way now. So it doesn't need to handle redundancy group in stop path separately. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-4-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:52:05 +08:00
Xiao Ni	5f286f3355	md: Don't clear MD_CLOSING until mddev is freed UNTIL_STOP is used to avoid mddev is freed on the last close before adding disks to mddev. And it should be cleared when stopping an array which is mentioned in commit `efeb53c0e5` ("md: Allow md devices to be created by name."). So reset ->hold_active to 0 in md_clean. And MD_CLOSING should be kept until mddev is freed to avoid reopen. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-3-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:51:59 +08:00
Xiao Ni	9e59d60976	md: call del_gendisk in control path Now del_gendisk and put_disk are called asynchronously in workqueue work. The asynchronous way has a problem that the device node can still exist after mdadm --stop command returns in a short window. So udev rule can open this device node and create the struct mddev in kernel again. So put del_gendisk in control path and still leave put_disk in md_kobj_release to avoid uaf of gendisk. Function del_gendisk can't be called with reconfig_mutex. If it's called with reconfig mutex, a deadlock can happen. del_gendisk waits all sysfs files access to finish and sysfs file access waits reconfig mutex. So put del_gendisk after releasing reconfig mutex. But there is still a window that sysfs can be accessed between mddev_unlock and del_gendisk. So some actions (add disk, change level, .e.g) can happen which lead unexpected results. MD_DELETED is used to resolve this problem. MD_DELETED is set before releasing reconfig mutex and it should be checked for these sysfs access which need reconfig mutex. For sysfs access which don't need reconfig mutex, del_gendisk will wait them to finish. But it doesn't need to do this in function mddev_lock_nointr. There are ten places that call it. * Five of them are in dm raid which we don't need to care. MD_DELETED is only used for md raid. * stop_sync_thread, md_do_sync and md_start_sync are related sync request, and it needs to wait sync thread to finish before stopping an array. * md_ioctl: md_open is called before md_ioctl, so ->openers is added. It will fail to stop the array. So it doesn't need to check MD_DELETED here * md_set_readonly: It needs to call mddev_set_closing_and_sync_blockdev when setting readonly or read_auto. So it will fail to stop the array too because MD_CLOSING is already set. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/linux-raid/20250611073108.25463-2-xni@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-12 17:51:54 +08:00
Linus Torvalds	40f92e79b0	block-6.16-20250710 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhwb8QQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpuDTEAC5J4noilx4TRpKQ0gp3cF9KHvB2wAD0ry8 1Y45lccZGrdUyWBnB7KyvIKUHt4MVk5Lw4d3vkv1Shx6XesW35hbCOI2W7UPsMsL nEBYJcroNNKlTlx9TJazVs0xmjF6G7JwaYXD6CVNLkjAQXxdeGst2Or15vhD4soz 3nmwFAyP3sEU7ESRNZ53UaNaM2KW0BBNef+jcFn9MOdSZcilePY7ckh74JzCc9Oc GIcH0eTRDdfPi3TteLu/2VMNjpogX+9LY41r3laSKwSgEcYmj+pPFLuqjU6A82hg dT8FWJLR+GuUWTs9B7FuWWpmk7uwOPrIadSQo2DcTdiBSvBYuGv+0BPIxq1kfykn cUjresj49q2hNAjBK71iEDycZR+W+pn864r1mJg+8pASoKKyNX2/3iTvQj57RwFO phoICyxr37WxCYQMcTXYwPcYD8BnF7mTJIDYDFti4BY1w/dUwlSvsbfI9Zk1rxIH VumZzML0nhfTbEsq+QVOTZ6bq0hn71EmVONLamM1LdaoAh6PZU2CMduJuPjEgCNz I73xzc4MlshOZYBidiq++1yFnRX64pB6jPi2omu31PjXMd1ZZaZWENh0OGFp4zHX 8yCmJoWTs8BXy5v74tJWxMTvShfMlqFBuTQlRexh1kala4IdngQrZjO6vBU2pw2C 4orH43oFdA== =m0St -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - MD changes via Yu: - fix UAF due to stack memory used for bio mempool (Jinchao) - fix raid10/raid1 nowait IO error path (Nigel and Qixing) - fix kernel crash from reading bitmap sysfs entry (Håkon) - Fix for a UAF in the nbd connect error path - Fix for blocksize being bigger than pagesize, if THP isn't enabled * tag 'block-6.16-20250710' of git://git.kernel.dk/linux: block: reject bs > ps block devices when THP is disabled nbd: fix uaf in nbd_genl_connect() error path md/md-bitmap: fix GPF in bitmap_get_stats() md/raid1,raid10: strip REQ_NOWAIT from member bios raid10: cleanup memleak at raid10_make_request md/raid1: Fix stack memory use after return in raid1_reshape	2025-07-11 10:35:54 -07:00
Sheng Yong	b1bf1a782f	dm-bufio: fix sched in atomic context If "try_verify_in_tasklet" is set for dm-verity, DM_BUFIO_CLIENT_NO_SLEEP is enabled for dm-bufio. However, when bufio tries to evict buffers, there is a chance to trigger scheduling in spin_lock_bh, the following warning is hit: BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2745 in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 123, name: kworker/2:2 preempt_count: 201, expected: 0 RCU nest depth: 0, expected: 0 4 locks held by kworker/2:2/123: #0: ffff88800a2d1548 ((wq_completion)dm_bufio_cache){....}-{0:0}, at: process_one_work+0xe46/0x1970 #1: ffffc90000d97d20 ((work_completion)(&dm_bufio_replacement_work)){....}-{0:0}, at: process_one_work+0x763/0x1970 #2: ffffffff8555b528 (dm_bufio_clients_lock){....}-{3:3}, at: do_global_cleanup+0x1ce/0x710 #3: ffff88801d5820b8 (&c->spinlock){....}-{2:2}, at: do_global_cleanup+0x2a5/0x710 Preemption disabled at: [<0000000000000000>] 0x0 CPU: 2 UID: 0 PID: 123 Comm: kworker/2:2 Not tainted 6.16.0-rc3-g90548c634bd0 #305 PREEMPT(voluntary) Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Workqueue: dm_bufio_cache do_global_cleanup Call Trace: <TASK> dump_stack_lvl+0x53/0x70 __might_resched+0x360/0x4e0 do_global_cleanup+0x2f5/0x710 process_one_work+0x7db/0x1970 worker_thread+0x518/0xea0 kthread+0x359/0x690 ret_from_fork+0xf3/0x1b0 ret_from_fork_asm+0x1a/0x30 </TASK> That can be reproduced by: veritysetup format --data-block-size=4096 --hash-block-size=4096 /dev/vda /dev/vdb SIZE=$(blockdev --getsz /dev/vda) dmsetup create myverity -r --table "0 $SIZE verity 1 /dev/vda /dev/vdb 4096 4096 <data_blocks> 1 sha256 <root_hash> <salt> 1 try_verify_in_tasklet" mount /dev/dm-0 /mnt -o ro echo 102400 > /sys/module/dm_bufio/parameters/max_cache_size_bytes [read files in /mnt] Cc: stable@vger.kernel.org # v6.4+ Fixes: `450e8dee51` ("dm bufio: improve concurrent IO performance") Signed-off-by: Wang Shuai <wangshuai12@xiaomi.com> Signed-off-by: Sheng Yong <shengyong1@xiaomi.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-07-10 16:48:50 +02:00
Alistair Popple	21aa65bf82	mm: remove callers of pfn_t functionality All PFN_* pfn_t flags have been removed. Therefore there is no longer a need for the pfn_t type and all uses can be replaced with normal pfns. Link: https://lkml.kernel.org/r/bbedfa576c9822f8032494efbe43544628698b1f.1750323463.git-series.apopple@nvidia.com Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Balbir Singh <balbirs@nvidia.com> Cc: Björn Töpel <bjorn@kernel.org> Cc: Björn Töpel <bjorn@rivosinc.com> Cc: Chunyan Zhang <zhang.lyra@gmail.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Deepak Gupta <debug@rivosinc.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Inki Dae <m.szyprowski@samsung.com> Cc: John Groves <john@groves.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-07-09 22:42:19 -07:00
Håkon Bugge	c17fb542db	md/md-bitmap: fix GPF in bitmap_get_stats() The commit message of commit `6ec1f02394` ("md/md-bitmap: fix stats collection for external bitmaps") states: Remove the external bitmap check as the statistics should be available regardless of bitmap storage location. Return -EINVAL only for invalid bitmap with no storage (neither in superblock nor in external file). But, the code does not adhere to the above, as it does only check for a valid super-block for "internal" bitmaps. Hence, we observe: Oops: GPF, probably for non-canonical address 0x1cd66f1f40000028 RIP: 0010:bitmap_get_stats+0x45/0xd0 Call Trace: seq_read_iter+0x2b9/0x46a seq_read+0x12f/0x180 proc_reg_read+0x57/0xb0 vfs_read+0xf6/0x380 ksys_read+0x6d/0xf0 do_syscall_64+0x8c/0x1b0 entry_SYSCALL_64_after_hwframe+0x76/0x7e We fix this by checking the existence of a super-block for both the internal and external case. Fixes: `6ec1f02394` ("md/md-bitmap: fix stats collection for external bitmaps") Cc: stable@vger.kernel.org Reported-by: Gerald Gibson <gerald.gibson@oracle.com> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Link: https://lore.kernel.org/linux-raid/20250702091035.2061312-1-haakon.bugge@oracle.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:36:50 +08:00
Zheng Qixing	5fa31c4992	md/raid1,raid10: strip REQ_NOWAIT from member bios RAID layers don't implement proper non-blocking semantics for REQ_NOWAIT, making the flag potentially misleading when propagated to member disks. This patch clear REQ_NOWAIT from cloned bios in raid1/raid10. Retain original bio's REQ_NOWAIT flag for upper layer error handling. Maybe we can implement non-blocking I/O handling mechanisms within RAID in future work. Fixes: `9f346f7d4e` ("md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT") Signed-off-by: Zheng Qixing <zhengqixing@huawei.com> Link: https://lore.kernel.org/linux-raid/20250702102341.1969154-1-zhengqixing@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:33:46 +08:00
Nigel Croxon	43806c3d5b	raid10: cleanup memleak at raid10_make_request If raid10_read_request or raid10_write_request registers a new request and the REQ_NOWAIT flag is set, the code does not free the malloc from the mempool. unreferenced object 0xffff8884802c3200 (size 192): comm "fio", pid 9197, jiffies 4298078271 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 88 41 02 00 00 00 00 00 .........A...... 08 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace (crc c1a049a2): __kmalloc+0x2bb/0x450 mempool_alloc+0x11b/0x320 raid10_make_request+0x19e/0x650 [raid10] md_handle_request+0x3b3/0x9e0 __submit_bio+0x394/0x560 __submit_bio_noacct+0x145/0x530 submit_bio_noacct_nocheck+0x682/0x830 __blkdev_direct_IO_async+0x4dc/0x6b0 blkdev_read_iter+0x1e5/0x3b0 __io_read+0x230/0x1110 io_read+0x13/0x30 io_issue_sqe+0x134/0x1180 io_submit_sqes+0x48c/0xe90 __do_sys_io_uring_enter+0x574/0x8b0 do_syscall_64+0x5c/0xe0 entry_SYSCALL_64_after_hwframe+0x76/0x7e V4: changing backing tree to see if CKI tests will pass. The patch code has not changed between any versions. Fixes: `c9aa889b03` ("md: raid10 add nowait support") Signed-off-by: Nigel Croxon <ncroxon@redhat.com> Link: https://lore.kernel.org/linux-raid/c0787379-9caa-42f3-b5fc-369aed784400@redhat.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:30:41 +08:00
Wang Jinchao	d67ed2ccd2	md/raid1: Fix stack memory use after return in raid1_reshape In the raid1_reshape function, newpool is allocated on the stack and assigned to conf->r1bio_pool. This results in conf->r1bio_pool.wait.head pointing to a stack address. Accessing this address later can lead to a kernel panic. Example access path: raid1_reshape() { // newpool is on the stack mempool_t newpool, oldpool; // initialize newpool.wait.head to stack address mempool_init(&newpool, ...); conf->r1bio_pool = newpool; } raid1_read_request() or raid1_write_request() { alloc_r1bio() { mempool_alloc() { // if pool->alloc fails remove_element() { --pool->curr_nr; } } } } mempool_free() { if (pool->curr_nr < pool->min_nr) { // pool->wait.head is a stack address // wake_up() will try to access this invalid address // which leads to a kernel panic return; wake_up(&pool->wait); } } Fix: reinit conf->r1bio_pool.wait after assigning newpool. Fixes: `afeee514ce` ("md: convert to bioset_init()/mempool_init()") Signed-off-by: Wang Jinchao <wangjinchao600@gmail.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/linux-raid/20250612112901.3023950-1-wangjinchao600@gmail.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-07-05 19:17:37 +08:00
Matthew Wilcox (Oracle)	39107ccbc6	bcache: switch from pages to folios in read_super() Retrieve a folio from the page cache instead of a page. Removes a hidden call to compound_head(). Then be sure to call folio_put() instead of put_page() to release it. That doesn't save any calls to compound_head(), just moves them around. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Coly Li <colyli@kernel.org> Acked-back: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250702024848.343370-1-colyli@kernel.org [axboe: commit message massaging] Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-07-02 18:56:42 -06:00
Anuj Gupta	c6603b1d65	block: rename tuple_size field in blk_integrity to metadata_size The tuple_size field in blk_integrity currently represents the total size of metadata associated with each data interval. To make the meaning more explicit, rename tuple_size to metadata_size. This is a purely mechanical rename with no functional changes. Suggested-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Anuj Gupta <anuj20.g@samsung.com> Link: https://lore.kernel.org/20250630090548.3317-2-anuj20.g@samsung.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-07-01 14:00:14 +02:00
Damien Le Moal	409f9287da	dm: Check for forbidden splitting of zone write operations DM targets must not split zone append and write operations using dm_accept_partial_bio() as doing so is forbidden for zone append BIOs, breaks zone append emulation using regular write BIOs and potentially creates deadlock situations with queue freeze operations. Modify dm_accept_partial_bio() to add missing BUG_ON() checks for all these cases, that is, check that the BIO is a write or write zeroes operation. This change packs all the zone related checks together under a static_branch_unlikely(&zoned_enabled) and done only if the target is a zoned device. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-6-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:32 -06:00
Damien Le Moal	e549663849	dm: dm-crypt: Do not partially accept write BIOs with zoned targets Read and write operations issued to a dm-crypt target may be split according to the dm-crypt internal limits defined by the max_read_size and max_write_size module parameters (default is 128 KB). The intent is to improve processing time of large BIOs by splitting them into smaller operations that can be parallelized on different CPUs. For zoned dm-crypt targets, this BIO splitting is still done but without the parallel execution to ensure that the issuing order of write operations to the underlying devices remains sequential. However, the splitting itself causes other problems: 1) Since dm-crypt relies on the block layer zone write plugging to handle zone append emulation using regular write operations, the reminder of a split write BIO will always be plugged into the target zone write plugged. Once the on-going write BIO finishes, this reminder BIO is unplugged and issued from the zone write plug work. If this reminder BIO itself needs to be split, the reminder will be re-issued and plugged again, but that causes a call to a blk_queue_enter(), which may block if a queue freeze operation was initiated. This results in a deadlock as DM submission still holds BIOs that the queue freeze side is waiting for. 2) dm-crypt relies on the emulation done by the block layer using regular write operations for processing zone append operations. This still requires to properly return the written sector as the BIO sector of the original BIO. However, this can be done correctly only and only if there is a single clone BIO used for processing the original zone append operation issued by the user. If the size of a zone append operation is larger than dm-crypt max_write_size, then the orginal BIO will be split and processed as a chain of regular write operations. Such chaining result in an incorrect written sector being returned to the zone append issuer using the original BIO sector. This in turn results in file system data corruptions using xfs or btrfs. Fix this by modifying get_max_request_size() to always return the size of the BIO to avoid it being split with dm_accpet_partial_bio() in crypt_map(). get_max_request_size() is renamed to get_max_request_sectors() to clarify the unit of the value returned and its interface is changed to take a struct dm_target pointer and a pointer to the struct bio being processed. In addition to this change, to ensure that crypt_alloc_buffer() works correctly, set the dm-crypt device max_hw_sectors limit to be at most BIO_MAX_VECS << PAGE_SECTORS_SHIFT (1 MB with a 4KB page architecture). This forces DM core to split write BIOs before passing them to crypt_map(), and thus guaranteeing that dm-crypt can always accept an entire write BIO without needing to split it. This change does not have any effect on the read path of dm-crypt. Read operations can still be split and the BIO fragments processed in parallel. There is also no impact on the performance of the write path given that all zone write BIOs were already processed inline instead of in parallel. This change also does not affect in any way regular dm-crypt block devices. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Link: https://lore.kernel.org/r/20250625093327.548866-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:32 -06:00
Damien Le Moal	2df7168717	dm: Always split write BIOs to zoned device limits Any zoned DM target that requires zone append emulation will use the block layer zone write plugging. In such case, DM target drivers must not split BIOs using dm_accept_partial_bio() as doing so can potentially lead to deadlocks with queue freeze operations. Regular write operations used to emulate zone append operations also cannot be split by the target driver as that would result in an invalid writen sector value return using the BIO sector. In order for zoned DM target drivers to avoid such incorrect BIO splitting, we must ensure that large BIOs are split before being passed to the map() function of the target, thus guaranteeing that the limits for the mapped device are not exceeded. dm-crypt and dm-flakey are the only target drivers supporting zoned devices and using dm_accept_partial_bio(). In the case of dm-crypt, this function is used to split BIOs to the internal max_write_size limit (which will be suppressed in a different patch). However, since crypt_alloc_buffer() uses a bioset allowing only up to BIO_MAX_VECS (256) vectors in a BIO. The dm-crypt device max_segments limit, which is not set and so default to BLK_MAX_SEGMENTS (128), must thus be respected and write BIOs split accordingly. In the case of dm-flakey, since zone append emulation is not required, the block layer zone write plugging is not used and no splitting of BIOs required. Modify the function dm_zone_bio_needs_split() to use the block layer helper function bio_needs_zone_write_plugging() to force a call to bio_split_to_limits() in dm_split_and_process_bio(). This allows DM target drivers to avoid using dm_accept_partial_bio() for write operations on zoned DM devices. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250625093327.548866-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:31 -06:00
Damien Le Moal	f70291411b	block: Introduce bio_needs_zone_write_plugging() In preparation for fixing device mapper zone write handling, introduce the inline helper function bio_needs_zone_write_plugging() to test if a BIO requires handling through zone write plugging using the function blk_zone_plug_bio(). This function returns true for any write (op_is_write(bio) == true) operation directed at a zoned block device using zone write plugging, that is, a block device with a disk that has a zone write plug hash table. This helper allows simplifying the check on entry to blk_zone_plug_bio() and used in to protect calls to it for blk-mq devices and DM devices. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20250625093327.548866-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-06-30 15:50:31 -06:00
Mikulas Patocka	6e11952a6a	dm-mpath: don't print the "loaded" message if registering fails If dm_register_path_selector, don't print the "version X loaded" message. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-30 16:20:17 +02:00
Mikulas Patocka	f86272350f	dm-mpath: make dm_unregister_path_selector return void dm_unregister_path_selector may only return error if there's a bug in the code - so we make it return void and print a warning if the user abuses this function to unregister a target that was not registered. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-30 16:20:04 +02:00
Dmitry Antipov	ebbd17695e	dm: ima: avoid extra calls to strlen() Since 'scnprintf()' returns the number of characters emitted (not including the trailing '\0'), use that return value instead of the subsequent calls to 'strlen()' where appropriate. Compile tested only. Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-27 13:07:17 +02:00
Damien Le Moal	548d88f74e	dm: Simplify dm_io_complete() The local variable first_requeue is not needed since it is always equal to dm_io_flagged(io, DM_IO_WAS_SPLIT). Call __dm_io_complete() passing this value directly and remove first_requeue. Also declare dm_io_complete() as inline to make sure it is inlined in its single call site, thus avoiding the cost of a function call. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-27 12:59:22 +02:00
Damien Le Moal	d142643c06	dm: Remove unnecessary return in dm_zone_endio() The return statement at the end of dm_zone_endio() is not needed. Remove it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-27 12:59:11 +02:00
Linus Torvalds	78f4e737a5	- dm-crypt: fix a crash on 32-bit machines - dm-raid: replace "rdev" with correct loop variable name "r" -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaFl3iRQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbcRMAP92ueTp0NFJr9dJne79HbhpJkBAS+b+ 25/qycKPv2XDfwD/c3/e3sBOhTIK8PohFR7lR62NepdfrOFVaaKubmNUlAU= =FD8P -----END PGP SIGNATURE----- Merge tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - dm-crypt: fix a crash on 32-bit machines - dm-raid: replace "rdev" with correct loop variable name "r" * tag 'for-6.16/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-raid: fix variable in journal device check dm-crypt: Extend state buffer size in crypt_iv_lmk_one	2025-06-23 15:02:57 -07:00
Heinz Mauelshagen	db53805156	dm-raid: fix variable in journal device check Replace "rdev" with correct loop variable name "r". Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Cc: stable@vger.kernel.org Fixes: `63c32ed4af` ("dm raid: add raid4/5/6 journaling support") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-23 16:42:37 +02:00
Herbert Xu	b872f562c8	dm-crypt: Extend state buffer size in crypt_iv_lmk_one Add a macro CRYPTO_MD5_STATESIZE for the Crypto API export state size of md5 and use that in dm-crypt instead of relying on the size of struct md5_state (the latter is currently undergoing a transition and may shrink). This commit fixes a crash on 32-bit machines: Oops: Oops: 0000 [#1] SMP CPU: 1 UID: 0 PID: 12 Comm: kworker/u16:0 Not tainted 6.16.0-rc2+ #993 PREEMPT(full) Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Workqueue: kcryptd-254:0-1 kcryptd_crypt [dm_crypt] EIP: __crypto_shash_export+0xf/0x90 Code: 4a c1 c7 40 20 a0 b4 4a c1 81 cf 0e 00 04 08 89 78 50 e9 2b ff ff ff 8d 74 26 00 55 89 e5 57 56 53 89 c3 89 d6 8b 00 8b 40 14 <8b> 50 fc f6 40 13 01 74 04 4a 2b 50 14 85 c9 74 10 89 f2 89 d8 ff EAX: 303a3435 EBX: c3007c90 ECX: 00000000 EDX: c3007c38 ESI: c3007c38 EDI: c3007c90 EBP: c3007bfc ESP: c3007bf0 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010216 CR0: 80050033 CR2: 303a3431 CR3: 04fbe000 CR4: 00350e90 Call Trace: crypto_shash_export+0x65/0xc0 crypt_iv_lmk_one+0x106/0x1a0 [dm_crypt] Fixes: `efd62c8552` ("crypto: md5-generic - Use API partial block handling") Reported-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Tested-by: Milan Broz <gmazyland@gmail.com> Closes: https://lore.kernel.org/linux-crypto/f1625ddc-e82e-4b77-80c2-dc8e45b54848@gmail.com/T/ Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-23 13:50:02 +02:00
Zhang Yi	2c46eab8da	dm: clear unmap write zeroes limits when disabling write zeroes The unmap write zeroes limits have been set to the stacking queue limits by default in blk_set_stacking_limits() and blk_stack_limits(), but it should be cleared if any underlying device does not support it. Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/20250619111806.3546162-6-yi.zhang@huaweicloud.com Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-06-23 12:45:13 +02:00
Heinz Mauelshagen	9de4a3967c	dm raid: add support for resync w/o metadata devices Target does not honour the "sync" argument when activated w/o metadata devices, e.g. with table line: "0 $(blockdev --getsz $data1) raid raid1 2 0 sync 2 - $data1 - $data2". Fix this to support temporary, transient raid devices useful for data duplication. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-23 12:35:37 +02:00
Kent Overstreet	75227ed681	dm-flakey: Fix corrupt_bio_byte setup checks Fix the error_reads mode - it's incompatible with corrupt_bio_byte, but that's only enabled if corrupt_bio_byte is nonzero. Cc: Benjamin Marzinski <bmarzins@redhat.com> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: dm-devel@lists.linux.dev Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Fixes: `19da6b2c9e` ("dm-flakey: Clean up parsing messages") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-23 12:04:52 +02:00
Benjamin Marzinski	8ca719b819	dm-table: fix checking for rq stackable devices Due to the semantics of iterate_devices(), the current code allows a request-based dm table as long as it includes one request-stackable device. It is supposed to only allow tables where there are no non-request-stackable devices. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-23 11:56:51 +02:00
Kuan-Wei Chiu	95b2e31e17	bcache: remove unnecessary select MIN_HEAP After reverting the transition to the generic min heap library, bcache no longer depends on MIN_HEAP. The select entry can be removed to reduce code size and shrink the kernel's attack surface. This change effectively reverts the bcache-related part of commit `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions"). This is part of a series of changes to address a performance regression caused by the use of the generic min_heap implementation. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-4-visitorckw@gmail.com Fixes: `866898efbb` ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-06-19 20:48:03 -07:00
Kuan-Wei Chiu	48fd7ebe00	Revert "bcache: remove heap-related macros and switch to generic min_heap" This reverts commit `866898efbb`. The generic bottom-up min_heap implementation causes performance regression in invalidate_buckets_lru(), a hot path in bcache. Before the cache is fully populated, new_bucket_prio() often returns zero, leading to many equal comparisons. In such cases, bottom-up sift_down performs up to 2 * log2(n) comparisons, while the original top-down approach completes with just O() comparisons, resulting in a measurable performance gap. The performance degradation is further worsened by the non-inlined min_heap API functions introduced in commit `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions"), adding function call overhead to this critical path. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. This revert aims to restore bcache's original low-latency behavior. Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-3-visitorckw@gmail.com Fixes: `866898efbb` ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-06-19 20:48:03 -07:00
Kuan-Wei Chiu	845f1f2d69	Revert "bcache: update min_heap_callbacks to use default builtin swap" Patch series "bcache: Revert min_heap migration due to performance regression". This patch series reverts the migration of bcache from its original heap implementation to the generic min_heap library. While the original change aimed to simplify the code and improve maintainability, it introduced a severe performance regression in real-world scenarios. As reported by Robert, systems using bcache now suffer from periodic latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. This degrades bcache's value as a low-latency caching layer, and leads to frequent timeouts and application stalls in production environments. The primary cause of this regression is the behavior of the generic min_heap implementation's bottom-up sift_down, which performs up to 2 * log2(n) comparisons when many elements are equal. The original top-down variant used by bcache only required O(1) comparisons in such cases. The issue was further exacerbated by commit `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions"), which introduced non-inlined versions of the min_heap API, adding function call overhead to a performance-critical hot path. This patch (of 3): This reverts commit `3d8a9a1c35`. Although removing the custom swap function simplified the code, this change is part of a broader migration to the generic min_heap API that introduced significant performance regressions in bcache. As reported by Robert, bcache now suffers from latency spikes, with P100 (max) latency increasing from 600 ms to 2.4 seconds every 5 minutes. These regressions degrade bcache's effectiveness as a low-latency cache layer and lead to frequent timeouts and application stalls in production environments. This revert is part of a series of changes to restore previous performance by undoing the min_heap transition. Link: https://lkml.kernel.org/r/20250614202353.1632957-1-visitorckw@gmail.com Link: https://lore.kernel.org/lkml/CAJhEC05+0S69z+3+FB2Cd0hD+pCRyWTKLEOsc8BOmH73p1m+KQ@mail.gmail.com Link: https://lkml.kernel.org/r/20250614202353.1632957-2-visitorckw@gmail.com Fixes: `866898efbb` ("bcache: remove heap-related macros and switch to generic min_heap") Fixes: `92a8b224b8` ("lib/min_heap: introduce non-inline versions of min heap API functions") Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reported-by: Robert Pang <robertpang@google.com> Closes: https://lore.kernel.org/linux-bcache/CAJhEC06F_AtrPgw2-7CvCqZgeStgCtitbD-ryuPpXQA-JG5XXw@mail.gmail.com Acked-by: Coly Li <colyli@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-06-19 20:48:02 -07:00
Ingo Molnar	41cb08555c	treewide, timers: Rename from_timer() to timer_container_of() Move this API to the canonical timer_*() namespace. [ tglx: Redone against pre rc1 ] Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/all/aB2X0jCKQO56WdMt@gmail.com	2025-06-08 09:07:37 +02:00
Linus Torvalds	6d8854216e	block-6.16-20250606 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776 x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE 31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0 32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs 20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve Dhpljqf3zA== =gs32 -----END PGP SIGNATURE----- Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - NVMe pull request via Christoph: - TCP error handling fix (Shin'ichiro Kawasaki) - TCP I/O stall handling fixes (Hannes Reinecke) - fix command limits status code (Keith Busch) - support vectored buffers also for passthrough (Pavel Begunkov) - spelling fixes (Yi Zhang) - MD pull request via Yu: - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10 - fix max_write_behind setting for dm-raid - some minor cleanups - Integrity data direction fix and cleanup - bcache NULL pointer fix - Fix for loop missing write start/end handling - Decouple hardware queues and IO threads in ublk - Slew of ublk selftests additions and updates * tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits) nvme: spelling fixes nvme-tcp: fix I/O stalls on congested sockets nvme-tcp: sanitize request list handling nvme-tcp: remove tag set when second admin queue config fails nvme: enable vectored registered bufs for passthrough cmds nvme: fix implicit bool to flags conversion nvme: fix command limits status code selftests: ublk: kublk: improve behavior on init failure block: flip iter directions in blk_rq_integrity_map_user() block: drop direction param from bio_integrity_copy_user() selftests: ublk: cover PER_IO_DAEMON in more stress tests Documentation: ublk: document UBLK_F_PER_IO_DAEMON selftests: ublk: add stress test for per io daemons selftests: ublk: add functional test for per io daemons selftests: ublk: kublk: decouple ublk_queues from ublk server threads selftests: ublk: kublk: move per-thread data out of ublk_queue selftests: ublk: kublk: lift queue initialization out of thread selftests: ublk: kublk: tie sqe allocation to io instead of queue selftests: ublk: kublk: plumb q_id in io_uring user_data ublk: have a per-io daemon instead of a per-queue daemon ...	2025-06-06 13:12:50 -07:00
Linus Torvalds	3c727285f1	- dm: better error handling when reloading a table - dm-delay: don't busy-wait in kthread - dm: use use generic disable_* functions instead of open coding them - dm: lock queue limits when reading them - dm-verity: use softirq context only when !need_resched() - dm-bufio: remove maximum age based eviction - dm: remove unneeded kvfree from alloc_targets - dm-flakey: various fixes - dm-mpath: interface for explicit probing of active paths - dm: fix BLK_FEAT_ATOMIC_WRITES - dm: pass through operations on wrapped inline crypto keys - dm vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_() helper functions where applicable - dm-mpath: replace spin_lock_irqsave with spin_lock_irq - dm-mirror: fix a tiny race condition - dm-verity: fix a memory leak if some arguments are specified multiple times - dm-stripe: small code cleanup -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaD8vFBQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbYGFAQC/V62PzDUa326WSdvwhtYe6jphInlW ZSmh37L4MIcV2wD/S8UqIaC9GSakee6jEWBTRiDqNZNEOIWhbd7f6gnTOQ0= =eYUe -----END PGP SIGNATURE----- Merge tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - better error handling when reloading a table - use use generic disable_ functions instead of open coding them - lock queue limits when reading them - remove unneeded kvfree from alloc_targets - fix BLK_FEAT_ATOMIC_WRITES - pass through operations on wrapped inline crypto keys - dm-verity: - use softirq context only when !need_resched() - fix a memory leak if some arguments are specified multiple times - dm-mpath: - interface for explicit probing of active paths - replace spin_lock_irqsave with spin_lock_irq - dm-delay: don't busy-wait in kthread - dm-bufio: remove maximum age based eviction - dm-flakey: various fixes - vdo indexer: don't read request structure after enqueuing - dm-zone: Use bdev_() helper functions where applicable - dm-mirror: fix a tiny race condition - dm-stripe: small code cleanup tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits) dm-stripe: small code cleanup dm-verity: fix a memory leak if some arguments are specified multiple times dm-mirror: fix a tiny race condition dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock dm mpath: replace spin_lock_irqsave with spin_lock_irq dm-mpath: Don't grab work_mutex while probing paths dm-zone: Use bdev_*() helper functions where applicable dm vdo indexer: don't read request structure after enqueuing dm: pass through operations on wrapped inline crypto keys blk-crypto: export wrapped key functions dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits dm mpath: Interface for explicit probing of active paths dm: Allow .prepare_ioctl to handle ioctls directly dm-flakey: make corrupting read bios work dm-flakey: remove useless ERROR_READS check in flakey_end_io dm-flakey: error all IOs when num_features is absent dm-flakey: Clean up parsing messages dm: remove unneeded kvfree from alloc_targets dm-bufio: remove maximum age based eviction dm-verity: use softirq context only when !need_resched() ...	2025-06-03 15:54:46 -07:00
Mikulas Patocka	9f2f6316d7	dm-stripe: small code cleanup This commit doesn't fix any bug, it is just code cleanup. Use the function format_dev_t instead of sprintf, because format_dev_t does the same thing. Remove the useless memset call. An unsigned integer can take at most 10 digits, so extend the array size to 22. (note that because the range of minor and major numbers is limited, the size 16 could not be exceeded, thus this function couldn't write beyond string end) Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-03 19:06:32 +02:00
Mikulas Patocka	66be40a14e	dm-verity: fix a memory leak if some arguments are specified multiple times If some of the arguments "check_at_most_once", "ignore_zero_blocks", "use_fec_from_device", "root_hash_sig_key_desc" were specified more than once on the target line, a memory leak would happen. This commit fixes the memory leak. It also fixes error handling in verity_verify_sig_parse_opt_args. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2025-06-03 19:01:42 +02:00
Mikulas Patocka	829451beae	dm-mirror: fix a tiny race condition There's a tiny race condition in dm-mirror. The functions queue_bio and write_callback grab a spinlock, add a bio to the list, drop the spinlock and wake up the mirrord thread that processes bios in the list. It may be possible that the mirrord thread processes the bio just after spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious wake-up is normally harmless, however if the device mapper device is unloaded just after the bio was processed, it may be possible that wakeup_mirrord(ms) uses invalid "ms" pointer. Fix this bug by moving wakeup_mirrord inside the spinlock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2025-06-03 19:01:23 +02:00
Benjamin Marzinski	85f6d5b729	dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock dm_set_device_limits() should check q->limits.features for BLK_FEAT_ATOMIC_WRITES while holding q->limits_lock, like it does for the rest of the queue limits. Fixes: `b7c18b17a1` ("dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-06-02 14:24:51 +02:00
Linus Torvalds	7d4e49a77d	- The 3 patch series "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores. - The 2 patch series "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2. - The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts. - The 9 patch series "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter. - The 2 patch series "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count. - The 3 patch series "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c. - The 3 patch series "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8= =g45L -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - "hung_task: extend blocking task stacktrace dump to semaphore" from Lance Yang enhances the hung task detector. The detector presently dumps the blocking tasks's stack when it is blocked on a mutex. Lance's series extends this to semaphores - "nilfs2: improve sanity checks in dirty state propagation" from Wentao Liang addresses a couple of minor flaws in nilfs2 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn fixes a couple of issues in the gdb scripts - "Support kdump with LUKS encryption by reusing LUKS volume keys" from Coiby Xu addresses a usability problem with kdump. When the dump device is LUKS-encrypted, the kdump kernel may not have the keys to the encrypted filesystem. A full writeup of this is in the series [0/N] cover letter - "sysfs: add counters for lockups and stalls" from Max Kellermann adds /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count - "fork: Page operation cleanups in the fork code" from Pasha Tatashin implements a number of code cleanups in fork.c - "scripts/gdb/symbols: determine KASLR offset on s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in the gdb scripts * tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits) llist: make llist_add_batch() a static inline delayacct: remove redundant code and adjust indentation squashfs: add optional full compressed block caching crash_dump, nvme: select CONFIGFS_FS as built-in scripts/gdb/symbols: determine KASLR offset on s390 during early boot scripts/gdb/symbols: factor out pagination_off() scripts/gdb/symbols: factor out get_vmlinux() kernel/panic.c: format kernel-doc comments mailmap: update and consolidate Casey Connolly's name and email nilfs2: remove wbc->for_reclaim handling fork: define a local GFP_VMAP_STACK fork: check charging success before zeroing stack fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code fork: clean-up ifdef logic around stack allocation kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count x86/crash: make the page that stores the dm crypt keys inaccessible x86/crash: pass dm crypt keys to kdump kernel Revert "x86/mm: Remove unused __set_memory_prot()" crash_dump: retrieve dm crypt keys in kdump kernel ...	2025-05-31 19:12:53 -07:00
Yu Kuai	01bf468c4e	md/md-bitmap: remove parameter slot from bitmap_create() All callers pass in '-1' for 'slot', hence it can be removed. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-6-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-30 15:47:23 +08:00
Yu Kuai	38f520a37d	md/md-bitmap: cleanup bitmap_ops->startwrite() bitmap_startwrite() always return 0, and the caller doesn't check return value as well, hence change the method to void. Also rename startwrite/endwrite to start_write/end_write, which is more in line with the usual naming convention. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-4-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-30 15:47:23 +08:00
Yu Kuai	b886475804	md/dm-raid: remove max_write_behind setting limit The comments said 'vaule in kB', while the value actually means the number of write_behind IOs. And since md-bitmap will automatically adjust the value to max COUNTER_MAX / 2, there is no need to fail early. Also move some macros that is only used md-bitmap.c. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-15-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-30 15:47:23 +08:00
Yu Kuai	2afe17794c	md/md-bitmap: fix dm-raid max_write_behind setting It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX. Link: https://lore.kernel.org/linux-raid/20250524061320.370630-14-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de>	2025-05-30 15:47:23 +08:00
Yu Kuai	9f346f7d4e	md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT IO with REQ_RAHEAD or REQ_NOWAIT can fail early, even if the storage medium is fine, hence record badblocks or remove the disk from array does not make sense. This problem if found by lvm2 test lvcreate-large-raid, where dm-zero will fail read ahead IO directly. Fixes: `e879a0d9cb` ("md/raid1,raid10: don't ignore IO flags") Reported-and-tested-by: Mikulas Patocka <mpatocka@redhat.com> Closes: https://lore.kernel.org/all/34fa755d-62c8-4588-8ee1-33cb1249bdf2@redhat.com/ Link: https://lore.kernel.org/linux-raid/20250527081407.3004055-1-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com>	2025-05-30 15:46:45 +08:00
Linus Torvalds	48cfc5791d	hardening updates for v6.16-rc1 - Update overflow helpers to ease refactoring of on-stack flex array instances (Gustavo A. R. Silva, Kees Cook) - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo) - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr) - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh) - Add missed designated initializers now exposed by fixed randstruct (Nathan Chancellor, Kees Cook) - Document compilers versions for __builtin_dynamic_object_size - Remove ARM_SSP_PER_TASK GCC plugin - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST builds - Kbuild: induce full rebuilds when dependencies change with GCC plugins, the Clang sanitizer .scl file, or the randstruct seed. - Kbuild: Switch from -Wvla to -Wvla-larger-than=1 - Correct several __nonstring uses for -Wunterminated-string-initialization -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaDUq9gAKCRA2KwveOeQk u+ZCAQDhqpOE/yn5gfjyplIvaTtzj9CaW6g11AmPYrimJCuj3QD9G+0o35kzlXOw f0ZIj2U7LFNgbLos+20hQwhMFf1Zhgg= =OYzD -----END PGP SIGNATURE----- Merge tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull hardening updates from Kees Cook: - Update overflow helpers to ease refactoring of on-stack flex array instances (Gustavo A. R. Silva, Kees Cook) - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo) - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr) - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh) - Add missed designated initializers now exposed by fixed randstruct (Nathan Chancellor, Kees Cook) - Document compilers versions for __builtin_dynamic_object_size - Remove ARM_SSP_PER_TASK GCC plugin - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST builds - Kbuild: induce full rebuilds when dependencies change with GCC plugins, the Clang sanitizer .scl file, or the randstruct seed. - Kbuild: Switch from -Wvla to -Wvla-larger-than=1 - Correct several __nonstring uses for -Wunterminated-string-initialization * tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits) Revert "hardening: Disable GCC randstruct for COMPILE_TEST" lib/tests: randstruct: Add deep function pointer layout test lib/tests: Add randstruct KUnit test randstruct: gcc-plugin: Remove bogus void member net: qede: Initialize qede_ll_ops with designated initializer scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops md/bcache: Mark __nonstring look-up table integer-wrap: Force full rebuild when .scl file changes randstruct: Force full rebuild when seed changes gcc-plugins: Force full rebuild when plugins change kbuild: Switch from -Wvla to -Wvla-larger-than=1 hardening: simplify CONFIG_CC_HAS_COUNTED_BY overflow: Fix direct struct member initialization in _DEFINE_FLEX() kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper overflow: Add STACK_FLEX_ARRAY_SIZE() helper input/joystick: magellan: Mark __nonstring look-up table const watchdog: exar: Shorten identity name to fit correctly mod_devicetable: Enlarge the maximum platform_device_id name length overflow: Clarify expectations for getting DEFINE_FLEX variable sizes compiler_types: Identify compiler versions for __builtin_dynamic_object_size ...	2025-05-28 07:47:10 -07:00
Mingzhe Zou	208c1559c5	bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang Reported an IO hang and unrecoverable error in our testing environment. After careful research, we found that bch_allocator_thread is stuck, the call stack is as follows: [<0>] __switch_to+0xbc/0x108 [<0>] __closure_sync+0x7c/0xbc [bcache] [<0>] bch_prio_write+0x430/0x448 [bcache] [<0>] bch_allocator_thread+0xb44/0xb70 [bcache] [<0>] kthread+0x124/0x130 [<0>] ret_from_fork+0x10/0x18 Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full occurs at the same time. When the cache disk is first used, the sb.nJournal_buckets defaults to 0. So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type buckets used up or btree_check_reserve() failed when request handle btree split, the request will be repeatedly retried and wait for alloc thread to fill in. After the alloc thread fills the buckets, it will call bch_prio_write(). If journal_full occurs simultaneously at this time, journal_reclaim() and btree_flush_write() will be called sequentially, journal_write cannot be completed. This is a low probability event, we believe that reserve more RESERVE_BTREE buckets can avoid the worst situation. Fixes: `682811b3ce` ("bcache: fix for allocator and register thread race") Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-4-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-27 07:38:19 -06:00
Robert Pang	5a08e49f23	bcache: remove unused constants Remove constants MAX_NEED_GC and MAX_SAVE_PRIO in btree.c that have been unused since initial commit. Signed-off-by: Robert Pang <robertpang@google.com> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-3-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-27 07:38:19 -06:00
Linggang Zeng	1e46ed947e	bcache: fix NULL pointer in cache_set_flush() 1. LINE#1794 - LINE#1887 is some codes about function of bch_cache_set_alloc(). 2. LINE#2078 - LINE#2142 is some codes about function of register_cache_set(). 3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098. 1794 struct cache_set bch_cache_set_alloc(struct cache_sb sb) 1795 { ... 1860 if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void ), GFP_KERNEL)) \|\| 1861 mempool_init_slab_pool(&c->search, 32, bch_search_cache) \|\| 1862 mempool_init_kmalloc_pool(&c->bio_meta, 2, 1863 sizeof(struct bbio) + sizeof(struct bio_vec) 1864 bucket_pages(c)) \|\| 1865 mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) \|\| 1866 bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio), 1867 BIOSET_NEED_BVECS\|BIOSET_NEED_RESCUER) \|\| 1868 !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) \|\| 1869 !(c->moving_gc_wq = alloc_workqueue("bcache_gc", 1870 WQ_MEM_RECLAIM, 0)) \|\| 1871 bch_journal_alloc(c) \|\| 1872 bch_btree_cache_alloc(c) \|\| 1873 bch_open_buckets_alloc(c) \|\| 1874 bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages))) 1875 goto err; ^^^^^^^^ 1876 ... 1883 return c; 1884 err: 1885 bch_cache_set_unregister(c); ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 1886 return NULL; 1887 } ... 2078 static const char register_cache_set(struct cache ca) 2079 { ... 2098 c = bch_cache_set_alloc(&ca->sb); 2099 if (!c) 2100 return err; ^^^^^^^^^^ ... 2128 ca->set = c; 2129 ca->set->cache[ca->sb.nr_this_dev] = ca; ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... 2138 return NULL; 2139 err: 2140 bch_cache_set_unregister(c); 2141 return err; 2142 } (1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and call bch_cache_set_unregister()(LINE#1885). (2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return. (3) As (2) has returned, LINE#2128 - LINE#2129 would do not give the value to c->cache[], it means that c->cache[] is NULL. LINE#1624 - LINE#1665 is some codes about function of cache_set_flush(). As (1), in LINE#1885 call bch_cache_set_unregister() ---> bch_cache_set_stop() ---> closure_queue() -.-> cache_set_flush() (as below LINE#1624) 1624 static void cache_set_flush(struct closure *cl) 1625 { ... 1654 for_each_cache(ca, c, i) 1655 if (ca->alloc_thread) ^^ 1656 kthread_stop(ca->alloc_thread); ... 1665 } (4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the kernel crash occurred as below: [ 846.712887] bcache: register_cache() error drbd6: cannot allocate memory [ 846.713242] bcache: register_bcache() error : failed to register device [ 846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered [ 846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8 [ 846.714790] PGD 0 P4D 0 [ 846.715129] Oops: 0000 [#1] SMP PTI [ 846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G OE --------- - - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1 [ 846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018 [ 846.716451] Workqueue: events cache_set_flush [bcache] [ 846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache] [ 846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7 [ 846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202 [ 846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000 [ 846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665 [ 846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700 [ 846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00 [ 846.720132] FS: 0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000 [ 846.720726] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0 [ 846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 846.722131] PKRU: 55555554 [ 846.722467] Call Trace: [ 846.722814] process_one_work+0x1a7/0x3b0 [ 846.723157] worker_thread+0x30/0x390 [ 846.723501] ? create_worker+0x1a0/0x1a0 [ 846.723844] kthread+0x112/0x130 [ 846.724184] ? kthread_flush_work_fn+0x10/0x10 [ 846.724535] ret_from_fork+0x35/0x40 Now, check whether that ca is NULL in LINE#1655 to fix the issue. Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-27 07:38:19 -06:00
Linus Torvalds	6f59de9bc0	for-6.16/block-20250523 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb Y6NDJw5+kQ== =1PNK -----END PGP SIGNATURE----- Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - ublk updates: - Add support for updating the size of a ublk instance - Zero-copy improvements - Auto-registering of buffers for zero-copy - Series simplifying and improving GET_DATA and request lookup - Series adding quiesce support - Lots of selftests additions - Various cleanups - NVMe updates via Christoph: - add per-node DMA pools and use them for PRP/SGL allocations (Caleb Sander Mateos, Keith Busch) - nvme-fcloop refcounting fixes (Daniel Wagner) - support delayed removal of the multipath node and optionally support the multipath node for private namespaces (Nilay Shroff) - support shared CQs in the PCI endpoint target code (Wilfred Mallawa) - support admin-queue only authentication (Hannes Reinecke) - use the crc32c library instead of the crypto API (Eric Biggers) - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes Reinecke, Leon Romanovsky, Gustavo A. R. Silva) - MD updates via Yu: - Fix that normal IO can be starved by sync IO, found by mkfs on newly created large raid5, with some clean up patches for bdev inflight counters - Clean up brd, getting rid of atomic kmaps and bvec poking - Add loop driver specifically for zoned IO testing - Eliminate blk-rq-qos calls with a static key, if not enabled - Improve hctx locking for when a plug has IO for multiple queues pending - Remove block layer bouncing support, which in turn means we can remove the per-node bounce stat as well - Improve blk-throttle support - Improve delay support for blk-throttle - Improve brd discard support - Unify IO scheduler switching. This should also fix a bunch of lockdep warnings we've been seeing, after enabling lockdep support for queue freezing/unfreezeing - Add support for block write streams via FDP (flexible data placement) on NVMe - Add a bunch of block helpers, facilitating the removal of a bunch of duplicated boilerplate code - Remove obsolete BLK_MQ pci and virtio Kconfig options - Add atomic/untorn write support to blktrace - Various little cleanups and fixes * tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits) selftests: ublk: add test for UBLK_F_QUIESCE ublk: add feature UBLK_F_QUIESCE selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE traceevent/block: Add REQ_ATOMIC flag to block trace events ublk: run auto buf unregisgering in same io_ring_ctx with registering io_uring: add helper io_uring_cmd_ctx_handle() ublk: remove io argument from ublk_auto_buf_reg_fallback() ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch() selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK selftests: ublk: support UBLK_F_AUTO_BUF_REG ublk: support UBLK_AUTO_BUF_REG_FALLBACK ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG ublk: prepare for supporting to register request buffer automatically ublk: convert to refcount_t selftests: ublk: make IO & device removal test more stressful nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk nvme: introduce multipath_always_on module param nvme-multipath: introduce delayed removal of the multipath head node nvme-pci: derive and better document max segments limits nvme-pci: use struct_size for allocation struct nvme_dev ...	2025-05-26 11:39:36 -07:00
Mikulas Patocka	050a3e71ce	dm mpath: replace spin_lock_irqsave with spin_lock_irq Replace spin_lock_irqsave/spin_unlock_irqrestore with spin_lock_irq/spin_unlock_irq at places where it is known that interrupts are enabled. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>	2025-05-22 15:56:42 +02:00
Benjamin Marzinski	5c977f1023	dm-mpath: Don't grab work_mutex while probing paths Grabbing the work_mutex keeps probe_active_paths() from running at the same time as multipath_message(). The only messages that could interfere with probing the paths are "disable_group", "enable_group", and "switch_group". These messages could force multipath to pick a new pathgroup while probe_active_paths() was probing the current pathgroup. If the multipath device has a hardware handler, and it switches active pathgroups while there is outstanding IO to a path device, it's possible that IO to the path will fail, even if the path would be usable if it was in the active pathgroup. To avoid this, do not clear the current pathgroup for the *_group messages while probe_active_paths() is running. Instead set a flag, and probe_active_paths() will clear the current pathgroup when it finishes probing the paths. For this to work correctly, multipath needs to check current_pg before next_pg in choose_pgpath(), but before this patch next_pg was only ever set when current_pg was cleared, so this doesn't change the current behavior when paths aren't being probed. Even with this change, it is still possible to switch pathgroups while the probe is running, but only if all the paths have failed, and the probe function will skip them as well in this case. If multiple DM_MPATH_PROBE_PATHS requests are received at once, there is no point in repeatedly issuing test IOs. Instead, the later probes should wait for the current probe to complete. If current pathgroup is still the same as the one that was just checked, the other probes should skip probing and just check the number of valid paths. Finally, probing the paths should quit early if the multipath device is trying to suspend, instead of continuing to issue test IOs, delaying the suspend. While this patch will not change the behavior of existing multipath users which don't use the DM_MPATH_PROBE_PATHS ioctl, when that ioctl is used, the behavior of the "disable_group", "enable_group", and "switch_group" messages can change subtly. When these messages return, the next IO to the multipath device will no longer be guaranteed to choose a new pathgroup. Instead, choosing a new pathgroup could be delayed by an in-progress DM_MPATH_PROBE_PATHS ioctl. The userspace multipath tools make no assumptions about what will happen to IOs after sending these messages, so this change will not effect already released versions of them, even if the DM_MPATH_PROBE_PATHS ioctl is run alongside them. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-16 13:23:45 +02:00
Bart Van Assche	241b9b584d	dm-zone: Use bdev_*() helper functions where applicable Improve code readability by using bdev_is_zone_aligned() and bdev_offset_from_zone_start() where applicable. No functionality has been changed. This patch is a reworked version of a patch from Pankaj Raghav. See also https://lore.kernel.org/linux-block/20220923173618.6899-11-p.raghav@samsung.com/. Cc: Damien Le Moal <dlemoal@kernel.org> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-15 15:55:07 +02:00
Matthew Sakai	3da732687d	dm vdo indexer: don't read request structure after enqueuing The function get_volume_page_protected may place a request on a queue for another thread to process asynchronously. When this happens, the volume should not read the request from the original thread. This can not currently cause problems, due to the way request processing is handled, but it is not safe in general. Reviewed-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-15 15:54:47 +02:00
Fedor Pchelkin	f3def8270c	sort.h: hoist cmp_int() into generic header file Deduplicate the same functionality implemented in several places by moving the cmp_int() helper macro into linux/sort.h. The macro performs a three-way comparison of the arguments mostly useful in different sorting strategies and algorithms. Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Suggested-by: Darrick J. Wong <djwong@kernel.org> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Acked-by: Coly Li <colyli@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Carlos Maiolino <cem@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Coly Li <colyli@kernel.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-05-11 17:54:12 -07:00
Yu Kuai	752d0464b7	md: clean up accounting for issued sync IO It's no longer used and can be removed, also remove the field 'gendisk->sync_io'. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:14:22 +08:00
Yu Kuai	e5797ae703	md: fix is_mddev_idle() If sync_speed is above speed_min, then is_mddev_idle() will be called for each sync IO to check if the array is idle, and inflight sync_io will be limited if the array is not idle. However, while mkfs.ext4 for a large raid5 array while recovery is in progress, it's found that sync_speed is already above speed_min while lots of stripes are used for sync IO, causing long delay for mkfs.ext4. Root cause is the following checking from is_mddev_idle(): t1: submit sync IO: events1 = completed IO - issued sync IO t2: submit next sync IO: events2 = completed IO - issued sync IO if (events2 - events1 > 64) For consequence, the more sync IO issued, the less likely checking will pass. And when completed normal IO is more than issued sync IO, the condition will finally pass and is_mddev_idle() will return false, however, last_events will be updated hence is_mddev_idle() can only return false once in a while. Fix this problem by changing the checking as following: 1) mddev doesn't have normal IO completed; 2) mddev doesn't have normal IO inflight; 3) if any member disks is partition, and all other partitions doesn't have IO completed. Also change rdev->last_events to unsigned long to cleanup type casting. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:13:31 +08:00
Yu Kuai	03720d82d7	md: add a new api sync_io_depth Currently if sync speed is above speed_min and below speed_max, md_do_sync() will wait for all sync IOs to be done before issuing new sync IO, means sync IO depth is limited to just 1. This limit is too low, in order to prevent sync speed drop conspicuously after fixing is_mddev_idle() in the next patch, add a new api for limiting sync IO depth, the default value is 32. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:12:52 +08:00
Yu Kuai	7168be3c8a	md: record dm-raid gendisk in mddev Following patch will use gendisk to check if there are normal IO completed or inflight, to fix a problem in mdraid that foreground IO can be starved by background sync IO in later patches. Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com>	2025-05-10 16:12:19 +08:00
Kees Cook	82d76bf938	md/bcache: Mark __nonstring look-up table GCC 15's new -Wunterminated-string-initialization notices that the 16 character lookup table "zero_uuid" (which is not used as a C-String) needs to be marked as "nonstring": drivers/md/bcache/super.c: In function 'uuid_find_empty': drivers/md/bcache/super.c:549:43: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization] 549 \| static const char zero_uuid[16] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Add the annotation (since it is not used as a C-String), and switch the initializer to an array of bytes rather than an empty initializer, as preferred by Coly Li. Suggested-by: Coly Li <colyli@kernel.org> Link: https://lore.kernel.org/lkml/389A9925-0990-422C-A1B3-0195FAA73288@coly.li/ Signed-off-by: Kees Cook <kees@kernel.org>	2025-05-08 09:42:06 -07:00
Christoph Hellwig	bd4e709b32	dm-integrity: use bio_add_virt_nofail Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it, and do the same for the similar pattern using bio_add_page for adding the first segment after a bio allocation as that can't fail either. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	9134124ce1	dm-bufio: use bio_add_virt_nofail Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Christoph Hellwig	23f5d69dfa	bcache: use bio_add_virt_nofail Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the bio_add_virt_nofail helper implementing it. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@kernel.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2025-05-07 07:31:07 -06:00
Eric Biggers	e93912786e	dm: pass through operations on wrapped inline crypto keys Make the device-mapper layer pass through the derive_sw_secret, import_key, generate_key, and prepare_key blk-crypto operations when all underlying devices support hardware-wrapped inline crypto keys and are passing through inline crypto support. Commit `ebc4176551` ("blk-crypto: add basic hardware-wrapped key support") already made BLK_CRYPTO_KEY_TYPE_HW_WRAPPED be passed through in the same way that the other crypto capabilities are. But the wrapped key support also includes additional operations in blk_crypto_ll_ops, and the dm layer needs to implement those to pass them through. derive_sw_secret is needed by fscrypt, while the other operations are needed for the new blk-crypto ioctls to work on device-mapper devices and not just the raw partitions. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-06 19:08:20 +02:00
Linus Torvalds	cccd033714	- dm: fix reading past the end of allocated memory - dm: fix missing dm_put_live_table() in dm_keyslot_evict() -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaBn8PhQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbVCUAQDDMCRu68hiL5SWai9YXhw40rPTuC7k e/zHIwRsObItgAD/YvRH1d85XBhQY5x3PCHa3j1u9q+S3uF4naG1n1afqw8= =L5lI -----END PGP SIGNATURE----- Merge tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix reading past the end of allocated memory - fix missing dm_put_live_table() in dm_keyslot_evict() * tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: fix copying after src array boundaries dm: add missing unlock on in dm_keyslot_evict()	2025-05-06 08:14:20 -07:00
Tudor Ambarus	f1aff4bc19	dm: fix copying after src array boundaries The blammed commit copied to argv the size of the reallocated argv, instead of the size of the old_argv, thus reading and copying from past the old_argv allocated memory. Following BUG_ON was hit: [ 3.038929][ T1] kernel BUG at lib/string_helpers.c:1040! [ 3.039147][ T1] Internal error: Oops - BUG: 00000000f2000800 [#1] SMP ... [ 3.056489][ T1] Call trace: [ 3.056591][ T1] __fortify_panic+0x10/0x18 (P) [ 3.056773][ T1] dm_split_args+0x20c/0x210 [ 3.056942][ T1] dm_table_add_target+0x13c/0x360 [ 3.057132][ T1] table_load+0x110/0x3ac [ 3.057292][ T1] dm_ctl_ioctl+0x424/0x56c [ 3.057457][ T1] __arm64_sys_ioctl+0xa8/0xec [ 3.057634][ T1] invoke_syscall+0x58/0x10c [ 3.057804][ T1] el0_svc_common+0xa8/0xdc [ 3.057970][ T1] do_el0_svc+0x1c/0x28 [ 3.058123][ T1] el0_svc+0x50/0xac [ 3.058266][ T1] el0t_64_sync_handler+0x60/0xc4 [ 3.058452][ T1] el0t_64_sync+0x1b0/0x1b4 [ 3.058620][ T1] Code: f800865e a9bf7bfd 910003fd 941f48aa (d4210000) [ 3.058897][ T1] ---[ end trace 0000000000000000 ]--- [ 3.059083][ T1] Kernel panic - not syncing: Oops - BUG: Fatal exception Fix it by copying the size of src, and not the size of dst, as it was. Fixes: `5a2a6c4281` ("dm: always update the array size in realloc_argv on success") Cc: stable@vger.kernel.org Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-06 14:06:59 +02:00
John Garry	b7c18b17a1	dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits Feature flag BLK_FEAT_ATOMIC_WRITES is not being properly set for the target queue limits, and this means that atomic writes are not being enabled for any dm personalities. When calling dm_set_device_limits() -> blk_stack_limits() -> ... -> blk_stack_atomic_writes_limits(), the bottom device limits (which corresponds to intermediate target queue limits) does not have BLK_FEAT_ATOMIC_WRITES set, and so atomic writes can never be enabled. Typically such a flag would be inherited from the stacked device in dm_set_device_limits() -> blk_stack_limits() via BLK_FEAT_INHERIT_MASK, but BLK_FEAT_ATOMIC_WRITES is not inherited as it's preferred to manually enable on a per-personality basis. Set BLK_FEAT_ATOMIC_WRITES manually for the intermediate target queue limits from the stacked device to get atomic writes working. Fixes: `3194e36488` ("dm-table: atomic writes support") Cc: stable@vger.kernel.org # v6.14 Signed-off-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 12:02:39 +02:00
Kevin Wolf	7734fb4ad9	dm mpath: Interface for explicit probing of active paths Multipath cannot directly provide failover for ioctls in the kernel because it doesn't know what each ioctl means and which result could indicate a path error. Userspace generally knows what the ioctl it issued means and if it might be a path error, but neither does it know which path the ioctl took nor does it necessarily have the privileges to fail a path using the control device. In order to allow userspace to address this situation, implement a DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all active paths in the current path group to see whether they still work, and fail them if not. If this returns success, userspace can retry the ioctl and expect that the previously hit bad path is now failed (or working again). The immediate motivation for this is the use of SG_IO in QEMU for SCSI passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing to ensure that all active paths are actually alive, so that retrying SG_IO at least has a lower chance of failing due to a path error. However, the problem is broader than just SG_IO (it affects any ioctl), and if applications need failover support for other ioctls, the same probing can be used. This is not implemented on the DM control device, but on the DM mpath block devices, to allow all users who have access to such a block device to make use of this interface, specifically to implement failover for ioctls. For the same reason, it is also unprivileged. Its implementation is effectively just a bunch of reads, which could already be issued by userspace, just without any guarantee that all the rights paths are selected. The probing implemented here is done fully synchronously path by path; probing all paths concurrently is left as an improvement for the future. Co-developed-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:06 +02:00
Kevin Wolf	4862c8861d	dm: Allow .prepare_ioctl to handle ioctls directly This adds a 'bool forward' parameter to .prepare_ioctl, which allows device mapper targets to accept ioctls to themselves instead of the underlying device. If the target already fully handled the ioctl, it sets forward to false and device mapper won't forward it to the underlying device any more. In order for targets to actually know what the ioctl is about and how to handle it, pass also cmd and arg. As long as targets restrict themselves to interpreting ioctls of type DM_IOCTL, this is a backwards compatible change because previously, any such ioctl would have been passed down through all device mapper layers until it reached a device that can't understand the ioctl and would return an error. Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	13e79076c8	dm-flakey: make corrupting read bios work dm-flakey corrupts the read bios in the endio function. However, the corrupt_bio_* functions checked bio_has_data() to see if there was data to corrupt. Since this was the endio function, there was no data left to complete, so bio_has_data() was always false. Fix this by saving a copy of the bio's bi_iter in flakey_map(), and using this to initialize the iter for corrupting the read bios. This patch also skips cloning the bio for write bios with no data. Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Fixes: `a3998799fb` ("dm flakey: add corrupt_bio_byte feature") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	4319f0aaa2	dm-flakey: remove useless ERROR_READS check in flakey_end_io If ERROR_READS is set, flakey_map returns DM_MAPIO_KILL for read bios and flakey_end_io is never called, so there's no point in checking it there. Also clean up an incorrect comment about when read IOs are errored out. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	40ed054f39	dm-flakey: error all IOs when num_features is absent dm-flakey would error all IOs if num_features was 0, but if it was absent, dm-flakey would never error any IO. Fix this so that no num_features works the same as num_features set to 0. Fixes: `aa7d7bc99f` ("dm flakey: add an "error_reads" option") Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	19da6b2c9e	dm-flakey: Clean up parsing messages There were a number of cases where the error message for an invalid table line did not match the actual problem. Fix these. Additionally, error out when duplicate corrupt_bio_byte, random_read_corrupt, or random_write_corrupt features are present. Also, error_reads is incompatible with random_read_corrupt and corrupt_bio_byte with the READ flag set, so disallow that. Reported-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	d90e7a500c	dm: remove unneeded kvfree from alloc_targets alloc_targets() is always called with a newly initialized table where t->highs == NULL. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Eric Biggers	9769378133	dm-bufio: remove maximum age based eviction Every 30 seconds, dm-bufio evicts all buffers that were not accessed within the last max_age_seconds, except those pinned in memory via retain_bytes. By default max_age_seconds is 300 (i.e. 5 minutes), and retain_bytes is 262144 (i.e. 256 KiB) per dm-bufio client. This eviction algorithm is much too eager and is also redundant with the shinker based eviction. Testing on an Android phone shows that about 30 MB of dm-bufio buffers (from dm-verity Merkle tree blocks) are loaded at boot time, and then about 90% of them are suddenly thrown away 5 minutes after boot. This results in unnecessary Merkle tree I/O later. Meanwhile, if the system actually encounters memory pressure, testing also shows that the shrinker is effective at evicting the buffers. Other major Linux kernel caches, such as the page cache, do not enforce a maximum age, instead relying on the shrinker. For these reasons, Android is now setting max_age_seconds to 86400 (i.e. 1 day), which mostly disables it; see `cadad290a7`%5E%21/ That is a much better default, but really the maximum age based eviction should not exist at all. Let's remove it. Note that this also eliminates the need to run work every 30 seconds, which is beneficial too. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Eric Biggers	f9ed31214e	dm-verity: use softirq context only when !need_resched() Further limit verification in softirq (a.k.a. BH) context to cases where rescheduling of the interrupted task is not pending. This helps prevent the CPU from spending too long in softirq context. Note that handle_softirqs() in kernel/softirq.c already stops running softirqs in this same case. However, that check is too coarse-grained, since many I/O requests can be processed in a single BLOCK_SOFTIRQ. Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Mikulas Patocka	abb4cf2f4c	dm: lock limits when reading them Lock queue limits when reading them, so that we don't read halfway modified values. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2025-05-04 11:35:05 +02:00
Mikulas Patocka	f1e24048ed	dm: use generic functions instead of disable_discard and disable_write_zeroes A small code cleanup: use blk_queue_disable_discard and blk_queue_disable_write_zeroes instead of disable_discard and disable_write_zeroes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	33304b75df	dm-delay: don't busy-wait in kthread When using a kthread to delay the IOs, dm-delay would continuously loop, checking if IOs were ready to submit. It had a cond_resched() call in the loop, but might still loop hundreds of millions of times waiting for an IO that was scheduled to be submitted 10s of ms in the future. With the change to make dm-delay over zoned devices always use kthreads regardless of the length of the delay, this wasted work only gets worse. To solve this and still keep roughly the same precision for very short delays, dm-delay now calls fsleep() for 1/8th of the smallest non-zero delay it will place on IOs, or 1 ms, whichever is smaller. The reason that dm-delay doesn't just use the actual expiration time of the next delayed IO to calculated the sleep time is that delay_dtr() must wait for the kthread to finish before deleting the table. If a zoned device with a long delay queued an IO shortly before being suspended and removed, the IO would be flushed in delay_presuspend(), but the removing the device would still have to wait for the remainder of the long delay. This time is now capped at 1 ms. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	ad320ae276	dm: fix native zone append devices on top of emulated ones If a DM device that can pass down zone append commands is stacked on top of a device that emulates zone append commands, it will allocate zone append emulation resources, even though it doesn't use them. This is because the underlying device will have max_hw_zone_append_sectors set to 0 to request zone append emulation. When the DM device is stacked on top of it, it will inherit that max_hw_zone_append_sectors limit, despite being able to pass down zone append bios. Solve this by making sure max_hw_zone_append_sectors is non-zero for DM devices that do not need zone append emulation. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	121218bef4	dm: limit swapping tables for devices with zone write plugs dm_revalidate_zones() only allowed new or previously unzoned devices to call blk_revalidate_disk_zones(). If the device was already zoned, disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones() returned without doing any work. This would make the zoned settings for the device not match the new table. If the device had zone write plug resources, it could run into errors like bdev_zone_is_seq() reading invalid memory because disk->conv_zones_bitmap was the wrong size. If the device doesn't have any zone write plug resources, calling blk_revalidate_disk_zones() will always correctly update device. If blk_revalidate_disk_zones() fails, it can still overwrite or clear the current disk->nr_zones value. In this case, DM must restore the previous value of disk->nr_zones, so that the zoned settings will continue to match the previous value that it fell back to. If the device already has zone write plug resources, blk_revalidate_disk_zones() will not correctly update them, if it is called for arbitrary zoned device changes. Since there is not much need for this ability, the easiest solution is to disallow any table reloads that change the zoned settings, for devices that already have zone plug resources. Specifically, if a device already has zone plug resources allocated, it can only switch to another zoned table that also emulates zone append. Also, it cannot change the device size or the zone size. A device can switch to an error target. Fixes: `bb37d77239` ("dm: introduce zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:35:05 +02:00
Benjamin Marzinski	37f53a2c60	dm: fix dm_blk_report_zones If dm_get_live_table() returned NULL, dm_put_live_table() was never called. Also, it is possible that md->zone_revalidate_map will change while calling this function. Only read it once, so that we are always using the same value. Otherwise we might miss a call to dm_put_live_table(). Finally, while md->zone_revalidate_map is set and a process is calling blk_revalidate_disk_zones() to set up the zone append emulation resources, it is possible that another process, perhaps triggered by blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If blk_revalidate_disk_zones() fails, these resources can be freed while the other process is still using them, causing a use-after-free error. blk_revalidate_disk_zones() will only ever be called when initially setting up the zone append emulation resources, such as when setting up a zoned dm-crypt table for the first time. Further table swaps will not set md->zone_revalidate_map or call blk_revalidate_disk_zones(). However it must be called using the new table (referenced by md->zone_revalidate_map) and the new queue limits while the DM device is suspended. dm_blk_report_zones() needs some way to distinguish between a call from blk_revalidate_disk_zones(), which must be allowed to use md->zone_revalidate_map to access this not yet activated table, and all other calls to dm_blk_report_zones(), which should not be allowed while the device is suspended and cannot use md->zone_revalidate_map, since the zone resources might be freed by the process currently calling blk_revalidate_disk_zones(). Solve this by tracking the process that sets md->zone_revalidate_map in dm_revalidate_zones() and only allowing that process to make use of it in dm_blk_report_zones(). Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Tested-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-05-04 11:34:56 +02:00
Dan Carpenter	650266ac4c	dm: add missing unlock on in dm_keyslot_evict() We need to call dm_put_live_table() even if dm_get_live_table() returns NULL. Fixes: `9355a9eb21` ("dm: support key eviction from keyslot managers of underlying devices") Cc: stable@vger.kernel.org # v5.12+ Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-04-30 18:17:43 +02:00

1 2 3 4 5 ...

8638 Commits