linux

Commit Graph

Author	SHA1	Message	Date
Yu Kuai	0c984a283a	md: add a new callback pers->bitmap_sector() This callback will be used in raid5 to convert io ranges from array to bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 08:56:11 -08:00
Yu Kuai	4f0e7d0e03	md/md-bitmap: remove the last parameter for bimtap_ops->endwrite() For the case that IO failed for one rdev, the bit will be mark as NEEDED in following cases: 1) If badblocks is set and rdev is not faulty; 2) If rdev is faulty; Case 1) is useless because synchronize data to badblocks make no sense. Case 2) can be replaced with mddev->degraded. Also remove R1BIO_Degraded, R10BIO_Degraded and STRIPE_DEGRADED since case 2) no longer use them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250109015145.158868-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 08:56:10 -08:00
Yu Kuai	08c50142a1	md/md-bitmap: factor behind write counters out from bitmap_{start/end}write() behind_write is only used in raid1, prepare to refactor bitmap_{start/end}write(), there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20250109015145.158868-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 08:56:10 -08:00
David Reaver	4fa91616c0	md: Replace deprecated kmap_atomic() with kmap_local_page() kmap_atomic() is deprecated and should be replaced with kmap_local_page() [1][2]. kmap_local_page() is faster in kernels with HIGHMEM enabled, can take page faults, and allows preemption. According to [2], this is safe as long as the code between kmap_atomic() and kunmap_atomic() does not implicitly depend on disabling page faults or preemption. It appears to me that none of the call sites in this patch depend on disabling page faults or preemption; they are all mapping a page to simply extract some information from it or print some debug info. [1] https://lwn.net/Articles/836144/ [2] https://docs.kernel.org/mm/highmem.html#temporary-virtual-mappings Signed-off-by: David Reaver <me@davidreaver.com> Link: https://lore.kernel.org/r/20250108192131.46843-1-me@davidreaver.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 07:36:29 -08:00
Yu Kuai	127186cfb1	md: reintroduce md-linear THe md-linear is removed by commit `849d18e27b` ("md: Remove deprecated CONFIG_MD_LINEAR") because it has been marked as deprecated for a long time. However, md-linear is used widely for underlying disks with different size, sadly we didn't know this until now, and it's true useful to create partitions and assemble multiple raid and then append one to the other. People have to use dm-linear in this case now, however, they will prefer to minimize the number of involved modules. Fixes: `849d18e27b` ("md: Remove deprecated CONFIG_MD_LINEAR") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Coly Li <colyli@kernel.org> Acked-by: Mike Snitzer <snitzer@kernel.org> Link: https://lore.kernel.org/r/20250102112841.1227111-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2025-01-13 07:36:29 -08:00
Linus Torvalds	0b7958fa05	- dm-array fixes - dm-verity forward error correction fixes - remove the flag DM_TARGET_PASSES_INTEGRITY from dm-ebs - dm-thin RCU list fix -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ36OARQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbQOrAP0c8fRH2lrrWr71G8SBPeAwUGAS3lh0 Rhm4Bf9hiwVqHAD+LG0Jq9IX8zo4Ou9xeHYI3NIBvQ2M7Kum7tGKcxvlXAE= =Qmf0 -----END PGP SIGNATURE----- Merge tag 'for-6.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - dm-array fixes - dm-verity forward error correction fixes - remove the flag DM_TARGET_PASSES_INTEGRITY from dm-ebs - dm-thin RCU list fix * tag 'for-6.13/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm thin: make get_first_thin use rcu-safe list first function dm-ebs: don't set the flag DM_TARGET_PASSES_INTEGRITY dm-verity FEC: Avoid copying RS parity bytes twice. dm-verity FEC: Fix RS FEC repair for roots unaligned to block size (take 2) dm array: fix cursor index when skipping across block boundaries dm array: fix unreleased btree blocks on closing a faulty array cursor dm array: fix releasing a faulty array block twice in dm_array_cursor_end	2025-01-08 10:12:01 -08:00
Krister Johansen	80f130bfad	dm thin: make get_first_thin use rcu-safe list first function The documentation in rculist.h explains the absence of list_empty_rcu() and cautions programmers against relying on a list_empty() -> list_first() sequence in RCU safe code. This is because each of these functions performs its own READ_ONCE() of the list head. This can lead to a situation where the list_empty() sees a valid list entry, but the subsequent list_first() sees a different view of list head state after a modification. In the case of dm-thin, this author had a production box crash from a GP fault in the process_deferred_bios path. This function saw a valid list head in get_first_thin() but when it subsequently dereferenced that and turned it into a thin_c, it got the inside of the struct pool, since the list was now empty and referring to itself. The kernel on which this occurred printed both a warning about a refcount_t being saturated, and a UBSAN error for an out-of-bounds cpuid access in the queued spinlock, prior to the fault itself. When the resulting kdump was examined, it was possible to see another thread patiently waiting in thin_dtr's synchronize_rcu. The thin_dtr call managed to pull the thin_c out of the active thins list (and have it be the last entry in the active_thins list) at just the wrong moment which lead to this crash. Fortunately, the fix here is straight forward. Switch get_first_thin() function to use list_first_or_null_rcu() which performs just a single READ_ONCE() and returns NULL if the list is already empty. This was run against the devicemapper test suite's thin-provisioning suites for delete and suspend and no regressions were observed. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Fixes: `b10ebd34cc` ("dm thin: fix rcu_read_lock being held in code that can sleep") Cc: stable@vger.kernel.org Acked-by: Ming-Hung Tsai <mtsai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-01-08 15:29:39 +01:00
Mikulas Patocka	47f33c27fc	dm-ebs: don't set the flag DM_TARGET_PASSES_INTEGRITY dm-ebs uses dm-bufio to process requests that are not aligned on logical sector size. dm-bufio doesn't support passing integrity data (and it is unclear how should it do it), so we shouldn't set the DM_TARGET_PASSES_INTEGRITY flag. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `d3c7b35c20` ("dm: add emulated block size target")	2025-01-08 15:28:47 +01:00
Milan Broz	548c6edbed	dm-verity FEC: Avoid copying RS parity bytes twice. Caching RS parity bytes is already done in fec_decode_bufs() now, no need to use yet another buffer for conversion to uint16_t. This patch removes that double copy of RS parity bytes. Signed-off-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-01-03 17:08:49 +01:00
Milan Broz	6df90c02ba	dm-verity FEC: Fix RS FEC repair for roots unaligned to block size (take 2) This patch fixes an issue that was fixed in the commit `df7b59ba92` ("dm verity: fix FEC for RS roots unaligned to block size") but later broken again in the commit `8ca7cab82b` ("dm verity fec: fix misaligned RS roots IO") If the Reed-Solomon roots setting spans multiple blocks, the code does not use proper parity bytes and randomly fails to repair even trivial errors. This bug cannot happen if the sector size is multiple of RS roots setting (Android case with roots 2). The previous solution was to find a dm-bufio block size that is multiple of the device sector size and roots size. Unfortunately, the optimization in commit `8ca7cab82b` ("dm verity fec: fix misaligned RS roots IO") is incorrect and uses data block size for some roots (for example, it uses 4096 block size for roots = 20). This patch uses a different approach: - It always uses a configured data block size for dm-bufio to avoid possible misaligned IOs. - and it caches the processed parity bytes, so it can join it if it spans two blocks. As the RS calculation is called only if an error is detected and the process is computationally intensive, copying a few more bytes should not introduce performance issues. The issue was reported to cryptsetup with trivial reproducer https://gitlab.com/cryptsetup/cryptsetup/-/issues/923 Reproducer (with roots=20): # create verity device with RS FEC dd if=/dev/urandom of=data.img bs=4096 count=8 status=none veritysetup format data.img hash.img --fec-device=fec.img --fec-roots=20 \| \ awk '/^Root hash/{ print $3 }' >roothash # create an erasure that should always be repairable with this roots setting dd if=/dev/zero of=data.img conv=notrunc bs=1 count=4 seek=4 status=none # try to read it through dm-verity veritysetup open data.img test hash.img --fec-device=fec.img --fec-roots=20 $(cat roothash) dd if=/dev/mapper/test of=/dev/null bs=4096 status=noxfer Even now the log says it cannot repair it: : verity-fec: 7:1: FEC 0: failed to correct: -74 : device-mapper: verity: 7:1: data block 0 is corrupted ... With this fix, errors are properly repaired. : verity-fec: 7:1: FEC 0: corrected 4 errors Signed-off-by: Milan Broz <gmazyland@gmail.com> Fixes: `8ca7cab82b` ("dm verity fec: fix misaligned RS roots IO") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2025-01-03 17:08:25 +01:00
Christoph Hellwig	cc76ace465	block: remove BLK_MQ_F_SHOULD_MERGE BLK_MQ_F_SHOULD_MERGE is set for all tag_sets except those that purely process passthrough commands (bsg-lib, ufs tmf, various nvme admin queues) and thus don't even check the flag. Remove it to simplify the driver interface. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241219060214.1928848-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:23 -07:00
John Garry	19206d3f5e	block: Delete bio_set_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_set_prio() does nothing but set bio->bi_ioprio. All other places just set bio->bi_ioprio directly, so replace bio_set_prio() remaining callsites with setting bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Acked-by: Jack Wang <jinpu.wang@ionos.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:23 -07:00
John Garry	5c292ac6e6	block: Delete bio_prio() Since commit `43b62ce3ff` ("block: move bio io prio to a new field"), macro bio_prio() does nothing but return the value in bio->bi_ioprio. Most other places just read bio->bi_ioprio directly, so replace bi_ioprio() callsites with reading bio->bi_ioprio directly and delete that macro. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20241202111957.2311683-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-23 08:17:22 -07:00
Ming-Hung Tsai	0bb1968da2	dm array: fix cursor index when skipping across block boundaries dm_array_cursor_skip() seeks to the target position by loading array blocks iteratively until the specified number of entries to skip is reached. When seeking across block boundaries, it uses dm_array_cursor_next() to step into the next block. dm_array_cursor_skip() must first move the cursor index to the end of the current block; otherwise, the cursor position could incorrectly remain in the same block, causing the actual number of skipped entries to be much smaller than expected. This bug affects cache resizing in v2 metadata and could lead to data loss if the fast device is shrunk during the first-time resume. For example: 1. create a cache metadata consists of 32768 blocks, with a dirty block assigned to the second bitmap block. cache_restore v1.0 is required. cat <<EOF >> cmeta.xml <superblock uuid="" block_size="64" nr_cache_blocks="32768" \ policy="smq" hint_width="4"> <mappings> <mapping cache_block="32767" origin_block="0" dirty="true"/> </mappings> </superblock> EOF dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2 2. bring up the cache while attempt to discard all the blocks belonging to the second bitmap block (block# 32576 to 32767). The last command is expected to fail, but it actually succeeds. dmsetup create cdata --table "0 2084864 linear /dev/sdc 8192" dmsetup create corig --table "0 65536 linear /dev/sdc 2105344" dmsetup create cache --table "0 65536 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 64 2 metadata2 writeback smq \ 2 migration_threshold 0" In addition to the reproducer described above, this fix can be verified using the "array_cursor/skip" tests in dm-unit: dm-unit run /pdata/array_cursor/skip/ --kernel-dir <KERNEL_DIR> Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `9b696229aa` ("dm persistent data: add cursor skip functions to the cursor APIs") Reviewed-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-12-13 08:39:18 -05:00
Ming-Hung Tsai	626f128ee9	dm array: fix unreleased btree blocks on closing a faulty array cursor The cached block pointer in dm_array_cursor might be NULL if it reaches an unreadable array block, or the array is empty. Therefore, dm_array_cursor_end() should call dm_btree_cursor_end() unconditionally, to prevent leaving unreleased btree blocks. This fix can be verified using the "array_cursor/iterate/empty" test in dm-unit: dm-unit run /pdata/array_cursor/iterate/empty --kernel-dir <KERNEL_DIR> Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `fdd1315aa5` ("dm array: introduce cursor api") Reviewed-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-12-13 08:37:39 -05:00
Ming-Hung Tsai	f2893c0804	dm array: fix releasing a faulty array block twice in dm_array_cursor_end When dm_bm_read_lock() fails due to locking or checksum errors, it releases the faulty block implicitly while leaving an invalid output pointer behind. The caller of dm_bm_read_lock() should not operate on this invalid dm_block pointer, or it will lead to undefined result. For example, the dm_array_cursor incorrectly caches the invalid pointer on reading a faulty array block, causing a double release in dm_array_cursor_end(), then hitting the BUG_ON in dm-bufio cache_put(). Reproduce steps: 1. initialize a cache device dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc $262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 dmsetup create cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. wipe the second array block offline dmsteup remove cache cmeta cdata corig mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \ 2>/dev/null \| hexdump -e '1/8 "%u\n"') ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \ 2>/dev/null \| hexdump -e '1/8 "%u\n"') dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock 3. try reopen the cache device dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc $262144" dmsetup create cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" Kernel logs: (snip) device-mapper: array: array_block_check failed: blocknr 0 != wanted 10 device-mapper: block manager: array validator check failed for block 10 device-mapper: array: get_ablock failed device-mapper: cache metadata: dm_array_cursor_next for mapping failed ------------[ cut here ]------------ kernel BUG at drivers/md/dm-bufio.c:638! Fix by setting the cached block pointer to NULL on errors. In addition to the reproducer described above, this fix can be verified using the "array_cursor/damaged" test in dm-unit: dm-unit run /pdata/array_cursor/damaged --kernel-dir <KERNEL_DIR> Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `fdd1315aa5` ("dm array: introduce cursor api") Reviewed-by: Joe Thornber <thornber@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-12-13 08:33:38 -05:00
Damien Le Moal	b76b840fd9	dm: Fix dm-zoned-reclaim zone write pointer alignment The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming another zone, to write the valid data blocks from the zone being reclaimed at the same position relative to the zone start in the reclaim target zone. The first call to blkdev_issue_zeroout() will try to use hardware offload using a REQ_OP_WRITE_ZEROES operation if the device reports a non-zero max_write_zeroes_sectors queue limit. If this operation fails because of the lack of hardware support, blkdev_issue_zeroout() falls back to using a regular write operation with the zero-page as buffer. Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by the block layer zone write plugging code which will execute a report zones operation to ensure that the write pointer of the target zone of the failed operation has not changed and to "rewind" the zone write pointer offset of the target zone as it was advanced when the write zero operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not cause any issue and blkdev_issue_zeroout() works as expected. However, since the automatic recovery of zone write pointers by the zone write plugging code can potentially cause deadlocks with queue freeze operations, a different recovery must be implemented in preparation for the removal of zone write plugging report zones based recovery. Do this by introducing the new function blk_zone_issue_zeroout(). This function first calls blkdev_issue_zeroout() with the flag BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution which attempt to use the device hardware offload with the REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone operation is issued to restore the zone write pointer offset of the target zone to the correct position and blkdev_issue_zeroout() is called again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones operation performing this recovery is implemented using the helper function disk_zone_sync_wp_offset() which calls the gendisk report_zones file operation with the callback disk_report_zones_cb(). This callback updates the target write pointer offset of the target zone using the new function disk_zone_wplug_sync_wp_offset(). dmz_reclaim_align_wp() is modified to change its call to blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any other change needed as the two functions are functionnally equivalent. Fixes: `dd291d77cc` ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-10 09:15:33 -07:00
Liequan Che	b2e382ae12	bcache: revert replacing IS_ERR_OR_NULL with IS_ERR again Commit `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") leads a NULL pointer deference in cache_set_flush(). 1721 if (!IS_ERR_OR_NULL(c->root)) 1722 list_add(&c->root->list, &c->btree_cache); >From the above code in cache_set_flush(), if previous registration code fails before allocating c->root, it is possible c->root is NULL as what it is initialized. __bch_btree_node_alloc() never returns NULL but c->root is possible to be NULL at above line 1721. This patch replaces IS_ERR() by IS_ERR_OR_NULL() to fix this. Fixes: `028ddcac47` ("bcache: Remove unnecessary NULL point check in node allocations") Signed-off-by: Liequan Che <cheliequan@inspur.com> Cc: stable@vger.kernel.org Cc: Zheng Wang <zyytlz.wz@163.com> Reviewed-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20241202115638.28957-1-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-12-03 15:06:27 -07:00
Linus Torvalds	cfd47302ac	block-6.13-20242901 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmdJ6jwQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvgeEADaPL+qmsRyp070K1OI/yTA9Jhf6liEyJ31 0GPVar5Vt6ObH6/POObrJqbBtAo5asQanFvwyVyLztKYxPHU7sdQaRcD+vvj7q3+ EhmQZSKM7Spp77awWhRWbeQfUBvdhTeGHjQH/0e60eIrF9KtEL9sM9hVqc8hBD9F YtDNWPCk7Rz1PPNYlGEkQ2JmYmaxh3Gn29c/k1cSSo3InEQOFj6x+0Cgz6RjbTx3 9HfpLhVG3WV5MlZCCwp7KG36aJzlc0nq53x/sC9cg+F17RvL2EwNAOUfLl75/Kp/ t7PCQSd2ODciiDN9qZW71KGtVtlJ07W048Rk0nB+ogneC0uh4fuIYTidP9D7io7D bBMrhDuUpnlPzlOqg0aeedXePQL7TRfT3CTentol6xldqg14n7C4QTQFQMSJCgJf gr4YCTwl0RTknXo0A3ja16XwsUq5+2xsSoCTU25TY+wgKiAcc5lN9fhbvPRzbCQC u9EQ9I9IFAMqEdnE51sw0x16fLtN2w4/zOkvTF+gD/KooEjSn9lcfeNue7jt1O0/ gFvFJCdXK/2GgxwHihvsEVdcNeaS8JowNafKUsfOM2G0qWQbY+l2vl/b5PfwecWi 0knOaqNWlGMwrQ+z+fgsEeFG7X98ninC7tqVZpzoZ7j0x65anH+Jq4q1Egongj0H 90zclclxjg== =6cbB -----END PGP SIGNATURE----- Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - Use correct srcu list traversal (Breno) - Scatter-gather support for metadata (Keith) - Fabrics shutdown race condition fix (Nilay) - Persistent reservations updates (Guixin) - Add the required bits for MD atomic write support for raid0/1/10 - Correct return value for unknown opcode in ublk - Fix deadlock with zone revalidation - Fix for the io priority request vs bio cleanups - Use the correct unsigned int type for various limit helpers - Fix for a race in loop - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make it easier for actual humans to read - Fix potential UAF when iterating tags - A few fixes for bfq-iosched UAF issues - Fix for brd discard not decrementing the allocated page count - Various little fixes and cleanups * tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits) brd: decrease the number of allocated pages which discarded block, bfq: fix bfqq uaf in bfq_limit_depth() block: Don't allow an atomic write be truncated in blkdev_write_iter() mq-deadline: don't call req_get_ioprio from the I/O completion handler block: Prevent potential deadlock in blk_revalidate_disk_zones() block: Remove extra part pointer NULLify in blk_rq_init() nvme: tuning pr code by using defined structs and macros nvme: introduce change ptpl and iekey definition block: return bool from get_disk_ro and bdev_read_only block: remove a duplicate definition for bdev_read_only block: return bool from blk_rq_aligned block: return unsigned int from blk_lim_dma_alignment_and_pad block: return unsigned int from queue_dma_alignment block: return unsigned int from bdev_io_opt block: req->bio is always set in the merge code block: don't bother checking the data direction for merges block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()" md/raid10: Atomic write support md/raid1: Atomic write support ...	2024-11-30 15:47:29 -08:00
Linus Torvalds	7eef7e306d	- Dm: remove unused functions and variables - Dm-ioctl: rate-limit error messages in syslog - Dm-persistent-data: fix typo - Dm-vdo murmurhash: remove u64 alignment requirement - Dm-vdo: reset bi_ioprio to the default - Dm: add support for get_unique_id - Dm thin: Add missing destroy_work_on_stack() - Dm-bufio: use kmalloc to allocate power-of-two sized buffers -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ0SJnBQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbV8xAP0dcS6O29X8HzSI8gVIgT8KYCpg3Ms2 amPDlm/RDewdSQD/Z5+CHlyrlYqnKejAIs7cbZlfxD/0avcg/Kc0h4ijGws= =2dtc -----END PGP SIGNATURE----- Merge tag 'for-6.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - remove unused functions and variables - rate-limit error messages in syslog - fix typo - remove u64 alignment requirement for murmurhash - reset bi_ioprio to the default for dm-vdo - add support for get_unique_id - Add missing destroy_work_on_stack() to dm-thin - use kmalloc to allocate power-of-two sized buffers in bufio * tag 'for-6.13/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-verity: remove the unused "data_start" variable dm-bufio: use kmalloc to allocate power-of-two sized buffers dm thin: Add missing destroy_work_on_stack() dm: add support for get_unique_id dm vdo: fix function doc comment formatting dm vdo int-map: remove unused parameters dm-vdo: reset bi_ioprio to the default value when the bio is reset dm-vdo murmurhash: remove u64 alignment requirement dm: Fix typo in error message dm ioctl: rate limit a couple of ioctl based error messages dm vdo: Remove unused uds_compute_index_size dm vdo: Remove unused functions dm: zoned: Remove unused functions dm: Remove unused dm_table_bio_based dm: Remove unused dm_set_md_type dm cache: Remove unused functions in bio-prison-v1 dm cache: Remove unused dm_cache_size dm cache: Remove unused dm_cache_dump dm cache: Remove unused btracker_nr_writebacks_queued	2024-11-25 18:54:00 -08:00
Linus Torvalds	f5f4745a7f	- The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code. - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[]. - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest. - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code. - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification. - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity. - Apart from that, singelton patches in many places - please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZ0L6lQAKCRDdBJ7gKXxA jmEIAPwMSglNPKRIOgzOvHh8MUJW1Dy8iKJ2kWCO3f6QTUIM2AEA+PazZbUd/g2m Ii8igH0UBibIgva7MrCyJedDI1O23AA= =8BIU -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - The series "resource: A couple of cleanups" from Andy Shevchenko performs some cleanups in the resource management code - The series "Improve the copy of task comm" from Yafang Shao addresses possible race-induced overflows in the management of task_struct.comm[] - The series "Remove unnecessary header includes from {tools/}lib/list_sort.c" from Kuan-Wei Chiu adds some cleanups and a small fix to the list_sort library code and to its selftest - The series "Enhance min heap API with non-inline functions and optimizations" also from Kuan-Wei Chiu optimizes and cleans up the min_heap library code - The series "nilfs2: Finish folio conversion" from Ryusuke Konishi finishes off nilfs2's folioification - The series "add detect count for hung tasks" from Lance Yang adds more userspace visibility into the hung-task detector's activity - Apart from that, singelton patches in many places - please see the individual changelogs for details * tag 'mm-nonmm-stable-2024-11-24-02-05' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (71 commits) gdb: lx-symbols: do not error out on monolithic build kernel/reboot: replace sprintf() with sysfs_emit() lib: util_macros_kunit: add kunit test for util_macros.h util_macros.h: fix/rework find_closest() macros Improve consistency of '#error' directive messages ocfs2: fix uninitialized value in ocfs2_file_read_iter() hung_task: add docs for hung_task_detect_count hung_task: add detect count for hung tasks dma-buf: use atomic64_inc_return() in dma_buf_getfile() fs/proc/kcore.c: fix coccinelle reported ERROR instances resource: avoid unnecessary resource tree walking in __region_intersects() ocfs2: remove unused errmsg function and table ocfs2: cluster: fix a typo lib/scatterlist: use sg_phys() helper checkpatch: always parse orig_commit in fixes tag nilfs2: convert metadata aops from writepage to writepages nilfs2: convert nilfs_recovery_copy_block() to take a folio nilfs2: convert nilfs_page_count_clean_buffers() to take a folio nilfs2: remove nilfs_writepage nilfs2: convert checkpoint file to be folio-based ...	2024-11-25 16:09:48 -08:00
Mikulas Patocka	a573e404cb	dm-verity: remove the unused "data_start" variable Remove the unused "data_start" variable. It is always set to zero and the user can't override it. If the user needs to use some existing offset within a block device, it is possible to use the linear target. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:44:44 +01:00
Mikulas Patocka	61a57254a9	dm-bufio: use kmalloc to allocate power-of-two sized buffers Vlastimil Babka said [1] that kmalloc will return a power-of-two-aligned buffer if it was called with a power-of-two size. So, we can use kmalloc instead of our own slab cache in dm-bufio. Note that the code for the slab cache was not removed because dm-bufio supports non-power-of-two buffer sizes. Link: https://lore.kernel.org/linux-mm/e7fca292-7c79-4f97-a90c-d68178d8ca59@suse.cz/ [1] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:44:23 +01:00
Yuan Can	e74fa2447b	dm thin: Add missing destroy_work_on_stack() This commit add missed destroy_work_on_stack() operations for pw->worker in pool_work_wait(). Fixes: `e7a3e871d8` ("dm thin: cleanup noflush_work to use a proper completion") Cc: stable@vger.kernel.org Signed-off-by: Yuan Can <yuancan@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:05 +01:00
Benjamin Coddington	d5f01ace54	dm: add support for get_unique_id This adds support to obtain a device's unique id through dm, similar to the existing ioctl and persistent resevation handling. We limit this to single-target devices. This enables knfsd to export pNFS SCSI luns that have been exported from multipath devices. Signed-off-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>	2024-11-20 11:38:04 +01:00
Matthew Sakai	19ac19e02f	dm vdo: fix function doc comment formatting Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Matthew Sakai	7e976b2b9d	dm vdo int-map: remove unused parameters Remove __always_unused parameters from static functions. Also fix minor formatting issues. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407141607.M3E2XQ0Z-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202409101018.B75pIBKR-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202410011107.U2xbVLRA-lkp@intel.com/ Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Susan LeGendre-McGhee	bd7e677c6b	dm-vdo: reset bi_ioprio to the default value when the bio is reset Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Susan LeGendre-McGhee	87d76d286c	dm-vdo murmurhash: remove u64 alignment requirement Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Ssuhung Yeh	2deb70d3e6	dm: Fix typo in error message Remove the redundant "i" at the beginning of the error message. This "i" came from commit `1c13188669` ("dm: prefer '"%s...", __func__'"), the "i" is accidentally left. Signed-off-by: Ssuhung Yeh <ssuhung@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `1c13188669` ("dm: prefer '"%s...", __func__'") Cc: stable@vger.kernel.org # v6.3+	2024-11-20 11:38:04 +01:00
Colin Ian King	51f0659f87	dm ioctl: rate limit a couple of ioctl based error messages It is possible to spam the kernel log with a misbehaving user process that is passing incorrect dm ioctls to /dev/mapper/control. Use a rate limit on these error messages to reduce the noise. These errors were hit when running the stress-ng's device test. Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Acked-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	b0e6210e7e	dm vdo: Remove unused uds_compute_index_size uds_compute_index_size() has been unused since it was added in commit `b46d79bdb8` ("dm vdo: add deduplication index storage interface") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	295815f679	dm vdo: Remove unused functions get_data_vio_pool_active_discards() get_data_vio_pool_discard_limit() get_data_vio_pool_maximum_discards() set_data_vio_pool_discard_limit() are all unused since commit `a9da0fb6d8` ("dm vdo: remove all sysfs interfaces") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	3571fc2f9d	dm: zoned: Remove unused functions dmz_resume_metadata() is unused since it was added in commit `3b1a94c88b` ("dm zoned: drive-managed zoned block device target") dmz_zone_nr_blocks_shift is unused since it was added in commit `3682056013` ("dm zoned: move fields from struct dmz_dev to dmz_metadata") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	ad9266118c	dm: Remove unused dm_table_bio_based dm_table_bio_based() is unused since commit `29dec90a0f` ("dm: fix bio_set allocation") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	feb83afa4e	dm: Remove unused dm_set_md_type dm_set_md_type() has been unused since commit `ba30585936` ("dm: move setting md->type into dm_setup_md_queue") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	047b821ca3	dm cache: Remove unused functions in bio-prison-v1 dm_cache_size() and dm_cache_dump() are unused since commit `b29d4986d0` ("dm cache: significant rework to leverage dm-bio-prison-v2") Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	0153b7965d	dm cache: Remove unused dm_cache_size dm_cache_size() has been unused since the original commit `c6b4fcbad0` ("dm: add cache target") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	253bacc057	dm cache: Remove unused dm_cache_dump dm_cache_dump() has been unused since the original commit `c6b4fcbad0` ("dm: add cache target") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
Dr. David Alan Gilbert	2133ebea6b	dm cache: Remove unused btracker_nr_writebacks_queued btracker_nr_writebacks_queued() has been unused since commit `2e63309507` ("dm cache policy smq: don't do any writebacks unless IDLE") Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-11-20 11:38:04 +01:00
John Garry	a1d9b4fd42	md/raid10: Atomic write support Set BLK_FEAT_ATOMIC_WRITES_STACKED to enable atomic writes. For an attempt to atomic write to a region which has bad blocks, error the write as we just cannot do this. It is unlikely to find devices which support atomic writes and bad blocks. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
John Garry	f2a38abf5f	md/raid1: Atomic write support Set BLK_FEAT_ATOMIC_WRITES_STACKED to enable atomic writes. For an attempt to atomic write to a region which has bad blocks, error the write as we just cannot do this. It is unlikely to find devices which support atomic writes and bad blocks. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
John Garry	fa6fec8281	md/raid0: Atomic write support Set BLK_FEAT_ATOMIC_WRITES_STACKED to enable atomic writes. All other stacked device request queue limits should automatically be set properly. With regards to atomic write max bytes limit, this will be set at hw_max_sectors and this is limited by the stripe width, which we want. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-19 10:30:02 -07:00
Linus Torvalds	77a0cfafa9	for-6.13/block-20241118 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmc7S40QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjHVD/43rDZ8ehs+IAAr6S0RemNX1SRG0mK2UOEb kMoNogS7StO/c4JYW3JuzCyLRn5ZsgeWV/muqxwDEWQrmTGrvi+V45KikrZPwm3k p0ump33qV9EU2jiR1MKZjtwK2P0CI7/DD3W8ww6IOvKbTT7RcqQcdHznvXArFBtc xCuQPpayFG7ZasC+N9VaBwtiUEVgU3Ek9AFT7UVZRWajjHPNalQwaooJWayO0rEG KdoW5yG0ryLrgCY2ACSvRLS+2s14EJtb8hgT08WKHTNgd5LxhSKxfsTapamua+7U FdVS6Ij0tEkgu2jpvgj7QKO0Uw10Cnep2gj7RHts/LVewvkliS6XcheOzqRS1jWU I2EI+UaGOZ11OUiw52VIveEVS5zV/NWhgy5BSP9LYEvXw0BUAHRDYGMem8o5G1V1 SWqjIM1UWvcQDlAnMF9FDVzojvjVUmYWvcAlFFztO8J0B7SavHR3NcfHwEf57reH rNoUbi/9c4/wjJJF33gejiR5pU+ewy/Mk75GrtX3xpEqlztfRbf9/FbPCMEAO1KR DF/b3lkUV9i2/BRW6a0SpZ5RDSmSYMnateel6TrPyVSRnpiSSFO8FrbynwUOa17b 6i49YDFWzzXOrR1YWDg6IEtTrcmBEmvi7F6aoDs020qUnL0hwLn1ZuoIxuiFEpor Z0iFF1B/nw== =PWTH -----END PGP SIGNATURE----- Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...	2024-11-18 16:50:08 -08:00
John Garry	ea90d27034	md/raid5: Increase r5conf.cache_name size For compiling with W=1, the following warning can be seen: drivers/md/raid5.c: In function ‘setup_conf’: drivers/md/raid5.c:2423:12: error: ‘%s’ directive output may be truncated writing up to 31 bytes into a region of size between 16 and 26 [-Werror=format-truncation=] "raid%d-%s", conf->level, mdname(conf->mddev)); ^~ drivers/md/raid5.c:2422:3: note: ‘snprintf’ output between 7 and 48 bytes into a destination of size 32 snprintf(conf->cache_name[0], namelen, ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "raid%d-%s", conf->level, mdname(conf->mddev)); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ cc1: all warnings being treated as errors Increase the array size to avoid this warning. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241112161019.4154616-2-john.g.garry@oracle.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-12 15:11:09 -08:00
Christoph Hellwig	559218d43e	block: pre-calculate max_zone_append_sectors max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 09:20:36 -07:00
Mikulas Patocka	346dbf1b13	dm-cache: fix warnings about duplicate slab caches The commit `4c39529663` adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-cache code. The dm-cache code allocates a slab cache for each device. This commit changes it to allocate just one slab cache in the module init function. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `4c39529663` ("slab: Warn on duplicate cache names when DEBUG_VM=y")	2024-11-11 17:04:39 +01:00
Mikulas Patocka	42964e4b5e	dm-bufio: fix warnings about duplicate slab caches The commit `4c39529663` adds a warning about duplicate cache names if CONFIG_DEBUG_VM is selected. These warnings are triggered by the dm-bufio code. The dm-bufio code allocates a slab cache with each client. It is not possible to preallocate the caches in the module init function because the size of auxiliary per-buffer data is not known at this point. So, this commit changes dm-bufio so that it appends a unique atomic value to the cache name, to avoid the warnings. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `4c39529663` ("slab: Warn on duplicate cache names when DEBUG_VM=y")	2024-11-11 17:04:18 +01:00
John Garry	4cf58d9529	md/raid10: Handle bio_split() errors Add proper bio_split() error handling. For any error, call raid_end_bio_io() and return. Except for discard, where we end the bio directly. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-7-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
John Garry	b1a7ad8b5c	md/raid1: Handle bio_split() errors Add proper bio_split() error handling. For any error, call raid_end_bio_io() and return. For the case of an in the write path, we need to undo the increment in the rdev pending count and NULLify the r1_bio->bios[] pointers. For read path failure, we need to undo rdev pending count increment from the earlier read_balance() call. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-6-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
John Garry	74538fdac3	md/raid0: Handle bio_split() errors Add proper bio_split() error handling. For any error, set bi_status, end the bio, and return. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-5-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-11 08:35:46 -07:00
Xiao Ni	fa1944bbe6	md/raid5: Wait sync io to finish before changing group cnt One customer reports a bug: raid5 is hung when changing thread cnt while resync is running. The stripes are all in conf->handle_list and new threads can't handle them. Commit `b39f35ebe8` ("md: don't quiesce in mddev_suspend()") removes pers->quiesce from mddev_suspend/resume. Before this patch, mddev_suspend needs to wait for all ios including sync io to finish. Now it's used to only wait normal io. Fix this by calling raid5_quiesce from raid5_store_group_thread_cnt directly to wait all sync requests to finish before changing the group cnt. Fixes: `b39f35ebe8` ("md: don't quiesce in mddev_suspend()") Cc: stable@vger.kernel.org Signed-off-by: Xiao Ni <xni@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241106095124.74577-1-xni@redhat.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-07 15:34:52 -08:00
Jens Axboe	ab9bc81c1c	Revert "block: pre-calculate max_zone_append_sectors" This causes issue on, at least, nvme-mpath where my boot fails with: WARNING: CPU: 354 PID: 2729 at block/blk-settings.c:75 blk_validate_limits+0x356/0x380 Modules linked in: tg3(+) nvme usbcore scsi_mod ptp i2c_piix4 libphy nvme_core crc32c_intel scsi_common usb_common pps_core i2c_smbus CPU: 354 UID: 0 PID: 2729 Comm: kworker/u2061:1 Not tainted 6.12.0-rc6+ #181 Hardware name: Dell Inc. PowerEdge R7625/06444F, BIOS 1.8.3 04/02/2024 Workqueue: async async_run_entry_fn RIP: 0010:blk_validate_limits+0x356/0x380 Code: f6 47 01 04 75 28 83 bf 94 00 00 00 00 75 39 83 bf 98 00 00 00 00 75 34 83 7f 68 00 75 32 31 c0 83 7f 5c 00 0f 84 9b fd ff ff <0f> 0b eb 13 0f 0b eb 0f 48 c7 c0 74 12 58 92 48 89 c7 e8 13 76 46 RSP: 0018:ffffa8a1dfb93b30 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff9232829c8388 RCX: 0000000000000088 RDX: 0000000000000080 RSI: 0000000000000200 RDI: ffffa8a1dfb93c38 RBP: 000000000000000c R08: 00000000ffffffff R09: 000000000000ffff R10: 0000000000000000 R11: 0000000000000000 R12: ffff9232829b9000 R13: ffff9232829b9010 R14: ffffa8a1dfb93c38 R15: ffffa8a1dfb93c38 FS: 0000000000000000(0000) GS:ffff923867c80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055c1b92480a8 CR3: 0000002484ff0002 CR4: 0000000000370ef0 Call Trace: <TASK> ? __warn+0xca/0x1a0 ? blk_validate_limits+0x356/0x380 ? report_bug+0x11a/0x1a0 ? handle_bug+0x5e/0x90 ? exc_invalid_op+0x16/0x40 ? asm_exc_invalid_op+0x16/0x20 ? blk_validate_limits+0x356/0x380 blk_alloc_queue+0x7a/0x250 __blk_alloc_disk+0x39/0x80 nvme_mpath_alloc_disk+0x13d/0x1b0 [nvme_core] nvme_scan_ns+0xcc7/0x1010 [nvme_core] async_run_entry_fn+0x27/0x120 process_scheduled_works+0x1a0/0x360 worker_thread+0x2bc/0x350 ? pr_cont_work+0x1b0/0x1b0 kthread+0x111/0x120 ? kthread_unuse_mm+0x90/0x90 ret_from_fork+0x30/0x40 ? kthread_unuse_mm+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> ---[ end trace 0000000000000000 ]--- presumably due to max_zone_append_sectors not being cleared to zero, resulting in blk_validate_zoned_limits() complaining and failing. This reverts commit `2a8f6153e1`. Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-07 05:45:34 -07:00
Linus Torvalds	9e23acf024	- fix memory safety bugs in dm-cache - fix restart/panic logic in dm-verity - fix 32-bit unsigned integer overflow in dm-unstriped - fix a device mapper crash if blk_alloc_disk fails -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZyj5MhQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbSpeAQCcyhjrFxvFQuTJm/nv65Txwqw3+nvu i45pJ1DbK1awEQD/W0xUhhWrHfXwnb2dHV/mowLSnlou7uUh/JUx9q24OQg= =5qjV -----END PGP SIGNATURE----- Merge tag 'for-6.12/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mikulas Patocka: - fix memory safety bugs in dm-cache - fix restart/panic logic in dm-verity - fix 32-bit unsigned integer overflow in dm-unstriped - fix a device mapper crash if blk_alloc_disk fails * tag 'for-6.12/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm cache: fix potential out-of-bounds access on the first resume dm cache: optimize dirty bit checking with find_next_bit when resizing dm cache: fix out-of-bounds access to the dirty bitset when resizing dm cache: fix flushing uninitialized delayed_work on cache_ctr error dm cache: correct the number of origin blocks to match the target length dm-verity: don't crash if panic_on_corruption is not selected dm-unstriped: cast an operand to sector_t to prevent potential uint32_t overflow dm: fix a crash if blk_alloc_disk fails	2024-11-06 07:56:47 -10:00
Yuan Can	6012169e8a	md/md-bitmap: Add missing destroy_work_on_stack() This commit add missed destroy_work_on_stack() operations for unplug_work.work in bitmap_unplug_async(). Fixes: `a022325ab9` ("md/md-bitmap: add a new helper to unplug bitmap asynchrously") Cc: stable@vger.kernel.org Signed-off-by: Yuan Can <yuancan@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241105130105.127336-1-yuancan@huawei.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 21:06:51 -08:00
Kuan-Wei Chiu	3d8a9a1c35	bcache: update min_heap_callbacks to use default builtin swap Replace the swp function pointer in the min_heap_callbacks of bcache with NULL, allowing direct usage of the default builtin swap implementation. This modification simplifies the code and improves performance by removing unnecessary function indirection. Link: https://lkml.kernel.org/r/20241020040200.939973-8-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:36 -08:00
Kuan-Wei Chiu	d684430207	dm vdo: update min_heap_callbacks to use default builtin swap Replace the swp function pointer in the min_heap_callbacks of dm-vdo with NULL, allowing direct usage of the default builtin swap implementation. This modification simplifies the code and improves performance by removing unnecessary function indirection. Link: https://lkml.kernel.org/r/20241020040200.939973-7-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:36 -08:00
Kuan-Wei Chiu	92a8b224b8	lib/min_heap: introduce non-inline versions of min heap API functions Patch series "Enhance min heap API with non-inline functions and optimizations", v2. Add non-inline versions of the min heap API functions in lib/min_heap.c and updates all users outside of kernel/events/core.c to use these non-inline versions. To mitigate the performance impact of indirect function calls caused by the non-inline versions of the swap and compare functions, a builtin swap has been introduced that swaps elements based on their size. Additionally, it micro-optimizes the efficiency of the min heap by pre-scaling the counter, following the same approach as in lib/sort.c. Documentation for the min heap API has also been added to the core-api section. This patch (of 10): All current min heap API functions are marked with '__always_inline'. However, as the number of users increases, inlining these functions everywhere leads to a increase in kernel size. In performance-critical paths, such as when perf events are enabled and min heap functions are called on every context switch, it is important to retain the inline versions for optimal performance. To balance this, the original inline functions are kept, and additional non-inline versions of the functions have been added in lib/min_heap.c. Link: https://lkml.kernel.org/r/20241020040200.939973-1-visitorckw@gmail.com Link: https://lore.kernel.org/20240522161048.8d8bbc7b153b4ecd92c50666@linux-foundation.org Link: https://lkml.kernel.org/r/20241020040200.939973-2-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ian Rogers <irogers@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: "Liang, Kan" <kan.liang@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-11-05 17:12:34 -08:00
Yu Kuai	649bfec690	md/raid5: don't set Faulty rdev for blocked_rdev Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	d419284c95	md/raid10: don't wait for Faulty rdev in wait_blocked_rdev() Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	ff31a7ef2b	md/raid1: don't wait for Faulty rdev in wait_blocked_rdev() Faulty rdev should never be accessed anymore, hence there is no point to wait for bad block to be acknowledged in this case while handling write request. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	88ed59c4cc	md/raid1: factor out helper to handle blocked rdev from raid1_write_request() Currently raid1 is preparing IO for underlying disks while checking if any disk is blocked, if so allocated resources must be released, then waiting for rdev to be unblocked and try to prepare IO again. Make code cleaner by checking blocked rdev first, it doesn't matter if rdev is blocked while issuing IO, the IO will wait for rdev to be unblocked or not. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:39 -08:00
Yu Kuai	29967332ce	md: don't record new badblocks for faulty rdev Faulty will be checked before issuing IO to the rdev, however, rdev can be faulty at any time, hence it's possible that rdev_set_badblocks() will be called for faulty rdev. In this case, mddev->sb_flags will be set and some other path can be blocked by updating super block. Since faulty rdev will not be accesed anymore, there is no need to record new babblocks for faulty rdev and forcing updating super block. Noted this is not a bugfix, just prevent updating superblock in some corner cases, and will help to slice a bug related to external metadata[1], testing also shows that devices are removed faster in the case IO error. [1] https://lore.kernel.org/all/f34452df-810b-48b2-a9b4-7f925699a9e7@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Yu Kuai	50e8274855	md: don't wait faulty rdev in md_wait_for_blocked_rdev() md_wait_for_blocked_rdev() is called for write IO while rdev is blocked, howerver, rdev can be faulty after choosing this rdev to write, and faulty rdev should never be accessed anymore, hence there is no point to wait for faulty rdev to be unblocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Yu Kuai	4abfce19c7	md: add a new helper rdev_blocked() The helper will be used in later patches for raid1/raid10/raid5, the difference is that Faulty rdev with unacknowledged bad block will not be considered blocked. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Tested-by: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/r/20241031033114.3845582-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Uros Bizjak	1e79892e76	md/raid5-ppl: Use atomic64_inc_return() in ppl_new_iounit() Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref) to use optimized implementation and ease register pressure around the primitive for targets that implement optimized variant. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20241007084831.48067-1-ubizjak@gmail.com Signed-off-by: Song Liu <song@kernel.org>	2024-11-05 16:08:38 -08:00
Christoph Hellwig	2a8f6153e1	block: pre-calculate max_zone_append_sectors max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-11-04 10:34:07 -07:00
Ming-Hung Tsai	c0ade5d989	dm cache: fix potential out-of-bounds access on the first resume Out-of-bounds access occurs if the fast device is expanded unexpectedly before the first-time resume of the cache table. This happens because expanding the fast device requires reloading the cache table for cache_create to allocate new in-core data structures that fit the new size, and the check in cache_preresume is not performed during the first resume, leading to the issue. Reproduce steps: 1. prepare component devices: dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct 2. load a cache table of 512 cache blocks, and deliberately expand the fast device before resuming the cache, making the in-core data structures inadequate. dmsetup create cache --notable dmsetup reload cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" dmsetup reload cdata --table "0 131072 linear /dev/sdc 8192" dmsetup resume cdata dmsetup resume cache 3. suspend the cache to write out the in-core dirty bitset and hint array, leading to out-of-bounds access to the dirty bitset at offset 0x40: dmsetup suspend cache KASAN reports: BUG: KASAN: vmalloc-out-of-bounds in is_dirty_callback+0x2b/0x80 Read of size 8 at addr ffffc90000085040 by task dmsetup/90 (...snip...) The buggy address belongs to the virtual mapping at [ffffc90000085000, ffffc90000087000) created by: cache_ctr+0x176a/0x35f0 (...snip...) Memory state around the buggy address: ffffc90000084f00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc90000084f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 >ffffc90000085000: 00 00 00 00 00 00 00 00 f8 f8 f8 f8 f8 f8 f8 f8 ^ ffffc90000085080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc90000085100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 Fix by checking the size change on the first resume. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `f494a9c6b1` ("dm cache: cache shrinking support") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com>	2024-11-04 17:39:31 +01:00
Ming-Hung Tsai	f484697e61	dm cache: optimize dirty bit checking with find_next_bit when resizing When shrinking the fast device, dm-cache iteratively searches for a dirty bit among the cache blocks to be dropped, which is less efficient. Use find_next_bit instead, as it is twice as fast as the iterative approach with test_bit. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `f494a9c6b1` ("dm cache: cache shrinking support") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com>	2024-11-04 17:39:31 +01:00
Ming-Hung Tsai	7922277197	dm cache: fix out-of-bounds access to the dirty bitset when resizing dm-cache checks the dirty bits of the cache blocks to be dropped when shrinking the fast device, but an index bug in bitset iteration causes out-of-bounds access. Reproduce steps: 1. create a cache device of 1024 cache blocks (128 bytes dirty bitset) dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 131072 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. shrink the fast device to 512 cache blocks, triggering out-of-bounds access to the dirty bitset (offset 0x80) dmsetup suspend cache dmsetup reload cdata --table "0 65536 linear /dev/sdc 8192" dmsetup resume cdata dmsetup resume cache KASAN reports: BUG: KASAN: vmalloc-out-of-bounds in cache_preresume+0x269/0x7b0 Read of size 8 at addr ffffc900000f3080 by task dmsetup/131 (...snip...) The buggy address belongs to the virtual mapping at [ffffc900000f3000, ffffc900000f5000) created by: cache_ctr+0x176a/0x35f0 (...snip...) Memory state around the buggy address: ffffc900000f2f80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc900000f3000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffffc900000f3080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ^ ffffc900000f3100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 ffffc900000f3180: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 Fix by making the index post-incremented. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `f494a9c6b1` ("dm cache: cache shrinking support") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com>	2024-11-04 17:39:31 +01:00
Ming-Hung Tsai	135496c208	dm cache: fix flushing uninitialized delayed_work on cache_ctr error An unexpected WARN_ON from flush_work() may occur when cache creation fails, caused by destroying the uninitialized delayed_work waker in the error path of cache_create(). For example, the warning appears on the superblock checksum error. Reproduce steps: dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dd if=/dev/urandom of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" Kernel logs: (snip) WARNING: CPU: 0 PID: 84 at kernel/workqueue.c:4178 __flush_work+0x5d4/0x890 Fix by pulling out the cancel_delayed_work_sync() from the constructor's error path. This patch doesn't affect the use-after-free fix for concurrent dm_resume and dm_destroy (commit `6a459d8edb` ("dm cache: Fix UAF in destroy()")) as cache_dtr is not changed. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `6a459d8edb` ("dm cache: Fix UAF in destroy()") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com>	2024-11-04 17:39:31 +01:00
Ming-Hung Tsai	235d2e739f	dm cache: correct the number of origin blocks to match the target length When creating a cache device, the actual size of the cache origin might be greater than the specified cache target length. In such case, the number of origin blocks should match the cache target length, not the full size of the origin device, since access beyond the cache target is not possible. This issue occurs when reducing the origin device size using lvm, as lvreduce preloads the new cache table before resuming the cache origin, which can result in incorrect sizes for the discard bitset and smq hotspot blocks. Reproduce steps: 1. create a cache device consists of 4096 origin blocks dmsetup create cmeta --table "0 8192 linear /dev/sdc 0" dmsetup create cdata --table "0 65536 linear /dev/sdc 8192" dmsetup create corig --table "0 524288 linear /dev/sdc 262144" dd if=/dev/zero of=/dev/mapper/cmeta bs=4k count=1 oflag=direct dmsetup create cache --table "0 524288 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" 2. reduce the cache origin to 2048 oblocks, in lvreduce's approach dmsetup reload corig --table "0 262144 linear /dev/sdc 262144" dmsetup reload cache --table "0 262144 cache /dev/mapper/cmeta \ /dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0" dmsetup suspend cache dmsetup suspend corig dmsetup suspend cdata dmsetup suspend cmeta dmsetup resume corig dmsetup resume cdata dmsetup resume cmeta dmsetup resume cache 3. shutdown the cache, and check the number of discard blocks in superblock. The value is expected to be 2048, but actually is 4096. dmsetup remove cache corig cdata cmeta dd if=/dev/sdc bs=1c count=8 skip=224 2>/dev/null \| hexdump -e '1/8 "%u\n"' Fix by correcting the origin_blocks initialization in cache_create and removing the unused origin_sectors from struct cache_args accordingly. Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com> Fixes: `c6b4fcbad0` ("dm: add cache target") Cc: stable@vger.kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Joe Thornber <thornber@redhat.com>	2024-11-04 17:39:31 +01:00
Mikulas Patocka	a674d0cd56	dm-verity: don't crash if panic_on_corruption is not selected If the user sets panic_on_error and doesn't set panic_on_corruption, dm-verity should not panic on data mismatch. But, currently it panics, because it treats data mismatch as I/O error. This commit fixes the logic so that if there is data mismatch and panic_on_corruption or restart_on_corruption is not selected, the system won't restart or panic. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Fixes: `f811b83879` ("dm-verity: introduce the options restart_on_error and panic_on_error")	2024-11-04 17:39:23 +01:00
Zichen Xie	5a4510c762	dm-unstriped: cast an operand to sector_t to prevent potential uint32_t overflow This was found by a static analyzer. There may be a potential integer overflow issue in unstripe_ctr(). uc->unstripe_offset and uc->unstripe_width are defined as "sector_t"(uint64_t), while uc->unstripe, uc->chunk_size and uc->stripes are all defined as "uint32_t". The result of the calculation will be limited to "uint32_t" without correct casting. So, we recommend adding an extra cast to prevent potential integer overflow. Fixes: `18a5bf2705` ("dm: add unstriped target") Signed-off-by: Zichen Xie <zichenxie0106@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2024-11-04 17:34:56 +01:00
Christoph Hellwig	2f5a65ef30	block: add a bdev_limits helper Add a helper to get the queue_limits from the bdev without having to poke into the request_queue. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-10-29 09:15:00 -06:00
Linus Torvalds	75f8b2f526	block-6.12-20241026 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmcc66sQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjplD/9/LUtySoj29m8jF4cTgnysGsuAjqcgNyU+ ykykZPQca+cWVhQzgFHob7N09C4y2gF2h/wokKM2cS8gaWzsSKvRaZBTeiQODJrf yAqG47BRXo6KJIpqT+A+FB0eDgRitCFweq5Is7Jh/rQooqJNvZb6W3hmK/eIfKxM BcY98/v02/eA/hry+IqAUzhoKHASxc/iFJJ8u+lk1fJyNZvQeIgzdy6RJwp/101L hCA1grIQRLJ86hhvbqrrMCmKfZeeuXKvx106YFhRlG0TCpPOGCeYMeqowdH5JlX6 inzt2NfciqncQmnKp8m3DCi2keT7AT+D1QX92JuTBAxa99qkaoqoC6b/EjbAIRpc 0cTR+G13LbyKlUuGMSRxa50EQtG4lkkIj3VlKAkxHPtEqy9y2+mK0JA33myYunTG wzOL8LKl0seLKtC8zHpcBZi5KNZt1MEu7GiibJVFdouje3X/VtDs00KymOboL7Uk W5YmpOSpLa1kh4U1FvdT0U1/xaV0Tb4UB3xjF0Qqhtqe1js1Vq86r5u/aiX3F3oZ 0emqwd/lMCGEzqRY7qeBN0zEj4LLXU/3Lxn6k+1LjX4exxjMaS5loZ6tPq5czxoC M5Qh2JmEP7zLx9hNg6QjOO+cCmLrG/oWCZRyxSsHguNeEgdEdKYZ8yDOPXtKuqkE Qc6YkxIOsQ== =zznR -----END PGP SIGNATURE----- Merge tag 'block-6.12-20241026' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Pull request for MD via Song fixing a few issues - Fix a wrong check in blk_rq_map_user_bvec(), causing IO errors on passthrough IO (Xinyu) * tag 'block-6.12-20241026' of git://git.kernel.dk/linux: block: fix sanity checks in blk_rq_map_user_bvec md/raid10: fix null ptr dereference in raid10_size() md: ensure child flush IO does not affect origin bio->bi_status	2024-10-27 08:29:36 -10:00
Yu Kuai	825711e001	md/raid10: fix null ptr dereference in raid10_size() In raid10_run() if raid10_set_queue_limits() succeed, the return value is set to zero, and if following procedures failed raid10_run() will return zero while mddev->private is still NULL, causing null ptr dereference in raid10_size(). Fix the problem by only overwrite the return value if raid10_set_queue_limits() failed. Fixes: `3d8466ba68` ("md/raid10: use the atomic queue limit update APIs") Cc: stable@vger.kernel.org Reported-and-tested-by: ValdikSS <iam@valdikss.org.ru> Closes: https://lore.kernel.org/all/0dd96820-fe52-4841-bc58-dbf14d6bfcc8@valdikss.org.ru/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241009014914.1682037-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-10-17 11:43:43 -07:00
Li Nan	62ce0782bb	md: ensure child flush IO does not affect origin bio->bi_status When a flush is issued to an RAID array, a child flush IO is created and issued for each member disk in the RAID array. Since commit `b75197e86e` ("md: Remove flush handling"), each child flush IO has been chained with the original bio. As a result, the failure of any child IO could modify the bi_status of the original bio, potentially impacting the upper-layer filesystem. Fix the issue by preventing child flush IO from altering the original bio->bi_status as before. However, this design introduces a known issue: in the event of a power failure, if a flush IO on a member disk fails, the upper layers may not be informed. This issue is not easy to fix and will not be addressed for the time being in this issue. Fixes: `b75197e86e` ("md: Remove flush handling") Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240919063048.2887579-1-linan666@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-10-17 11:35:36 -07:00
Mikulas Patocka	fed13a5478	dm: fix a crash if blk_alloc_disk fails If blk_alloc_disk fails, the variable md->disk is set to an error value. cleanup_mapped_device will see that md->disk is non-NULL and it will attempt to access it, causing a crash on this statement "md->disk->private_data = NULL;". Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Chenyuan Yang <chenyuan0y@gmail.com> Closes: https://marc.info/?l=dm-devel&m=172824125004329&w=2 Cc: stable@vger.kernel.org Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com>	2024-10-15 13:37:17 +02:00
Linus Torvalds	7ec462100e	Getting rid of asm/unaligned.h includes -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv3NAgAKCRBZ7Krx/gZQ 68kbAP0YzQxUgl0/o7Soda8XwKSPZTM9ls6kRk1UHTTG/i4ZigEA/G+i/mBQctL0 AB911kK8mxfXppfOXzstFBjoJSqiigQ= =IE7D -----END PGP SIGNATURE----- Merge tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull generic unaligned.h cleanups from Al Viro: "Get rid of architecture-specific <asm/unaligned.h> includes, replacing them with a single generic <linux/unaligned.h> header file. It's the second largest (after asm/io.h) class of asm/* includes, and all but two architectures actually end up using exact same file. Massage the remaining two (arc and parisc) to do the same and just move the thing to from asm-generic/unaligned.h to linux/unaligned.h" [ This is one of those things that we're better off doing outside the merge window, and would only cause extra conflict noise if it was in linux-next for the next release due to all the trivial #include line updates. Rip off the band-aid. - Linus ] * tag 'pull-work.unaligned' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: move asm/unaligned.h to linux/unaligned.h arc: get rid of private asm/unaligned.h parisc: get rid of private asm/unaligned.h	2024-10-02 16:42:28 -07:00
Al Viro	5f60d5f6bb	move asm/unaligned.h to linux/unaligned.h asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h	2024-10-02 17:23:23 -04:00
Mikulas Patocka	f811b83879	dm-verity: introduce the options restart_on_error and panic_on_error This patch introduces the options restart_on_error and panic_on_error on dm-verity. Previously, restarting on error was handled by the patch `e6a3531dd5`, but Google engineers wanted to have a special option for it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Suggested-by: Sami Tolvanen <samitolvanen@google.com> Suggested-by: Will Drewry <wad@chromium.org>	2024-10-02 16:21:08 +02:00
Mikulas Patocka	462763212d	Revert: "dm-verity: restart or panic on an I/O error" This reverts commit `e6a3531dd5`. The problem that the commit `e6a3531dd5` fixes was reported as a security bug, but Google engineers working on Android and ChromeOS didn't want to change the default behavior, they want to get -EIO rather than restarting the system, so I am reverting that commit. Note also that calling machine_restart from the I/O handling code is potentially unsafe (the reboot notifiers may wait for the bio that triggered the restart), but Android uses the reboot notifiers to store the reboot reason into the PMU microcontroller, so machine_restart must be used. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org Fixes: `e6a3531dd5` ("dm-verity: restart or panic on an I/O error") Suggested-by: Sami Tolvanen <samitolvanen@google.com> Suggested-by: Will Drewry <wad@chromium.org>	2024-10-02 16:15:34 +02:00
Linus Torvalds	e477dba544	- Misc VDO fixes - Remove unused declarations dm_get_rq_mapinfo() and dm_zone_map_bio() - Dm-delay: Improve kernel documentation - Dm-crypt: Allow to specify the integrity key size as an option - Dm-bufio: Remove pointless NULL check - Small code cleanups: Use ERR_CAST; remove unlikely() around IS_ERR; use __assign_bit - Dm-integrity: Fix gcc 5 warning; convert comma to semicolon; fix smatch warning - Dm-integrity: Support recalculation in the 'I' mode - Revert "dm: requeue IO if mapping table not yet available" - Dm-crypt: Small refactoring to make the code more readable - Dm-cache: Remove pointless error check - Dm: Fix spelling errors - Dm-verity: Restart or panic on an I/O error if restart or panic was requested - Dm-verity: Fallback to platform keyring also if key in trusted keyring is rejected -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZvapzRQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbdKAAP4gHNU7aRmwTPcmvytEqBO4Pcz4eGB/ tytj2+o1orph3AD/YD2X75YHOrdNKTLq+N0ecetAt0yDVUnJAUtKiOnx6Q8= =0f9T -----END PGP SIGNATURE----- Merge tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - Misc VDO fixes - Remove unused declarations dm_get_rq_mapinfo() and dm_zone_map_bio() - Dm-delay: Improve kernel documentation - Dm-crypt: Allow to specify the integrity key size as an option - Dm-bufio: Remove pointless NULL check - Small code cleanups: Use ERR_CAST; remove unlikely() around IS_ERR; use __assign_bit - Dm-integrity: Fix gcc 5 warning; convert comma to semicolon; fix smatch warning - Dm-integrity: Support recalculation in the 'I' mode - Revert "dm: requeue IO if mapping table not yet available" - Dm-crypt: Small refactoring to make the code more readable - Dm-cache: Remove pointless error check - Dm: Fix spelling errors - Dm-verity: Restart or panic on an I/O error if restart or panic was requested - Dm-verity: Fallback to platform keyring also if key in trusted keyring is rejected * tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm verity: fallback to platform keyring also if key in trusted keyring is rejected dm-verity: restart or panic on an I/O error dm: fix spelling errors dm-cache: remove pointless error check dm vdo: handle unaligned discards correctly dm vdo indexer: Convert comma to semicolon dm-crypt: Use common error handling code in crypt_set_keyring_key() dm-crypt: Use up_read() together with key_put() only once in crypt_set_keyring_key() Revert "dm: requeue IO if mapping table not yet available" dm-integrity: check mac_size against HASH_MAX_DIGESTSIZE in sb_mac() dm-integrity: support recalculation in the 'I' mode dm integrity: Convert comma to semicolon dm integrity: fix gcc 5 warning dm: Make use of __assign_bit() API dm integrity: Remove extra unlikely helper dm: Convert to use ERR_CAST() dm bufio: Remove NULL check of list_entry() dm-crypt: Allow to specify the integrity key size as option dm: Remove unused declaration and empty definition "dm_zone_map_bio" dm delay: enhance kernel documentation ...	2024-09-27 09:12:51 -07:00
Luca Boccassi	579b2ba40e	dm verity: fallback to platform keyring also if key in trusted keyring is rejected If enabled, we fallback to the platform keyring if the trusted keyring doesn't have the key used to sign the roothash. But if pkcs7_verify() rejects the key for other reasons, such as usage restrictions, we do not fallback. Do so. Follow-up for `6fce1f40e9` Suggested-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Luca Boccassi <bluca@debian.org> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-26 17:27:08 +02:00
Mikulas Patocka	e6a3531dd5	dm-verity: restart or panic on an I/O error Maxim Suhanov reported that dm-verity doesn't crash if an I/O error happens. In theory, this could be used to subvert security, because an attacker can create sectors that return error with the Write Uncorrectable command. Some programs may misbehave if they have to deal with EIO. This commit fixes dm-verity, so that if "panic_on_corruption" or "restart_on_corruption" was specified and an I/O error happens, the machine will panic or restart. This commit also changes kernel_restart to emergency_restart - kernel_restart calls reboot notifiers and these reboot notifiers may wait for the bio that failed. emergency_restart doesn't call the notifiers. Reported-by: Maxim Suhanov <dfirblog@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2024-09-26 17:27:07 +02:00
Shen Lichuan	0a92e5cdee	dm: fix spelling errors Fixed some confusing spelling errors that were currently identified, the details are as follows: -in the code comments: dm-cache-target.c: 1371: exclussive ==> exclusive dm-raid.c: 2522: repective ==> respective Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-26 17:27:07 +02:00
Dipendra Khadka	4feb014bc7	dm-cache: remove pointless error check Smatch reported following: ''' drivers/md/dm-cache-target.c:3204 parse_cblock_range() warn: sscanf doesn't return error codes drivers/md/dm-cache-target.c:3217 parse_cblock_range() warn: sscanf doesn't return error codes ''' Sscanf doesn't return negative values at all. Signed-off-by: Dipendra Khadka <kdipendra88@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-26 17:26:56 +02:00
Matthew Sakai	66cac80698	dm vdo: handle unaligned discards correctly Reset the data_vio properly for each discard block, and delay acknowledgement and cleanup until all discard blocks are complete. Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-23 15:15:41 +02:00
Shen Lichuan	9bcd923952	dm vdo indexer: Convert comma to semicolon To ensure code clarity and prevent potential errors, it's advisable to employ the ';' as a statement separator, except when ',' are intentionally used for specific purposes. Signed-off-by: Shen Lichuan <shenlichuan@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-18 18:26:35 +02:00
Markus Elfring	5d49054ef6	dm-crypt: Use common error handling code in crypt_set_keyring_key() Add a jump target so that a bit of exception handling can be better reused at the end of this function implementation. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-18 18:17:06 +02:00
Markus Elfring	c5391c0e04	dm-crypt: Use up_read() together with key_put() only once in crypt_set_keyring_key() The combination of the calls “up_read(&key->sem)” and “key_put(key)” was immediately used after a return code check for a set_key() call in this function implementation. Thus use such a function call pair only once instead directly before the check. This issue was transformed by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-18 18:10:40 +02:00
Linus Torvalds	a430d95c5e	lsm/stable-6.12 PR 20240911 -----BEGIN PGP SIGNATURE----- iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAmbiGGAUHHBhdWxAcGF1 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXPU8BAA1+A15pmS34I9pq7c8TmRz3rNEs/a zrW1aWJ0X/+axNS7sW3Pwtt1EKuaOhskKU8gNSieRhljC8rgXIVjZzLw6Atgcr5k upulGbU9TXyVisYN+PWv9/84ito6/nYsKb7Mg3nUVsdodtIFVnsk1fxYLPHQEBig Pl3i26U3VqH93Kz0W5vs/QR2uduPB8ZyscdTgcbrY9Vv1Y7IDZ2g9QsJVKLvbQKL qcPK1JkHa+sBPJxDqS9A40zgbLbdPQgWQzsXX3dz822w1Ga7FIHSqxMBA6HwHZ+L kV4P58wVfavhwt/cQSKMWI/yiGPMMd0B6yD+m8ojOvGfOfRCWxGMmEMqHNuZ3m7k Bfll5ZgZTY8phUUhiNf3nxO3F3MM/5bHdhPOj3RReqbAbS6uWr4/fThPDYY/zIo6 NCY3HGxx3Ae64uQ01gC2p/czC50jDsMwlbXiZbrgdBhjBm/CVk5ozb80mLVcGrLB +6XMzzSbC8IaNAH2fDmUJ2ABdwyNPgsSOTGZVzIanpxu1SU2/yk3SMxkp8fv5s36 wLeODUVcLgsjVV538Mkm6PGTE4TlXaH9yi6apMyJAGp0vPYx5c3Xxk2y5A5cur5p hcrbDiX2QgeqFbwsz36incmPmbef2NU2c8feR8XLtPJuwNIeRcMSje0pnkaFlRmb TAUJ1sDQAzZ8Fy0= =HIAO -----END PGP SIGNATURE----- Merge tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm Pull lsm updates from Paul Moore: - Move the LSM framework to static calls This transitions the vast majority of the LSM callbacks into static calls. Those callbacks which haven't been converted were left as-is due to the general ugliness of the changes required to support the static call conversion; we can revisit those callbacks at a future date. - Add the Integrity Policy Enforcement (IPE) LSM This adds a new LSM, Integrity Policy Enforcement (IPE). There is plenty of documentation about IPE in this patches, so I'll refrain from going into too much detail here, but the basic motivation behind IPE is to provide a mechanism such that administrators can restrict execution to only those binaries which come from integrity protected storage, e.g. a dm-verity protected filesystem. You will notice that IPE requires additional LSM hooks in the initramfs, dm-verity, and fs-verity code, with the associated patches carrying ACK/review tags from the associated maintainers. We couldn't find an obvious maintainer for the initramfs code, but the IPE patchset has been widely posted over several years. Both Deven Bowers and Fan Wu have contributed to IPE's development over the past several years, with Fan Wu agreeing to serve as the IPE maintainer moving forward. Once IPE is accepted into your tree, I'll start working with Fan to ensure he has the necessary accounts, keys, etc. so that he can start submitting IPE pull requests to you directly during the next merge window. - Move the lifecycle management of the LSM blobs to the LSM framework Management of the LSM blobs (the LSM state buffers attached to various kernel structs, typically via a void pointer named "security" or similar) has been mixed, some blobs were allocated/managed by individual LSMs, others were managed by the LSM framework itself. Starting with this pull we move management of all the LSM blobs, minus the XFRM blob, into the framework itself, improving consistency across LSMs, and reducing the amount of duplicated code across LSMs. Due to some additional work required to migrate the XFRM blob, it has been left as a todo item for a later date; from a practical standpoint this omission should have little impact as only SELinux provides a XFRM LSM implementation. - Fix problems with the LSM's handling of F_SETOWN The LSM hook for the fcntl(F_SETOWN) operation had a couple of problems: it was racy with itself, and it was disconnected from the associated DAC related logic in such a way that the LSM state could be updated in cases where the DAC state would not. We fix both of these problems by moving the security_file_set_fowner() hook into the same section of code where the DAC attributes are updated. Not only does this resolve the DAC/LSM synchronization issue, but as that code block is protected by a lock, it also resolve the race condition. - Fix potential problems with the security_inode_free() LSM hook Due to use of RCU to protect inodes and the placement of the LSM hook associated with freeing the inode, there is a bit of a challenge when it comes to managing any LSM state associated with an inode. The VFS folks are not open to relocating the LSM hook so we have to get creative when it comes to releasing an inode's LSM state. Traditionally we have used a single LSM callback within the hook that is triggered when the inode is "marked for death", but not actually released due to RCU. Unfortunately, this causes problems for LSMs which want to take an action when the inode's associated LSM state is actually released; so we add an additional LSM callback, inode_free_security_rcu(), that is called when the inode's LSM state is released in the RCU free callback. - Refactor two LSM hooks to better fit the LSM return value patterns The vast majority of the LSM hooks follow the "return 0 on success, negative values on failure" pattern, however, there are a small handful that have unique return value behaviors which has caused confusion in the past and makes it difficult for the BPF verifier to properly vet BPF LSM programs. This includes patches to convert two of these"special" LSM hooks to the common 0/-ERRNO pattern. - Various cleanups and improvements A handful of patches to remove redundant code, better leverage the IS_ERR_OR_NULL() helper, add missing "static" markings, and do some minor style fixups. * tag 'lsm-pr-20240911' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/lsm: (40 commits) security: Update file_set_fowner documentation fs: Fix file_set_fowner LSM hook inconsistencies lsm: Use IS_ERR_OR_NULL() helper function lsm: remove LSM_COUNT and LSM_CONFIG_COUNT ipe: Remove duplicated include in ipe.c lsm: replace indirect LSM hook calls with static calls lsm: count the LSMs enabled at compile time kernel: Add helper macros for loop unrolling init/main.c: Initialize early LSMs after arch code, static keys and calls. MAINTAINERS: add IPE entry with Fan Wu as maintainer documentation: add IPE documentation ipe: kunit test for parser scripts: add boot policy generation program ipe: enable support for fs-verity as a trust provider fsverity: expose verified fsverity built-in signatures to LSMs lsm: add security_inode_setintegrity() hook ipe: add support for dm-verity as a trust provider dm-verity: expose root hash digest and signature data to LSMs block,lsm: add LSM blob and new LSM hooks for block devices ipe: add permissive toggle ...	2024-09-16 18:19:47 +02:00
Linus Torvalds	26bb0d3f38	for-6.12/block-20240913 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmbkZhQQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjOKD/0fzd4yOcqxSI9W3OLGd04VrOTJIQa4CRbV GmoTq39pOeIDVGug5ekkTpqqHHnuGk+nQhCzD9vsN/eTmC7yZOIr847O2aWzvYEn PzFRgmJpoo2E9sr/IsTR5LnJjbaIZhQVkqLH6ZOj9tpKlVwN2SK0nIRVNrAi5zgT MaDrto/2OUld+vmA99Rgb23jxM6UBdCPIjuiVa+11Vg9Z3D1tWbBmrsG7OMysyIf FbASBeKHqFSO61/ipFCZv6VV1X8zoWEVyT8n4A1yUbbN5rLzPgoQJVbfSqQRXIdr cdrKeCbKxl+joSgKS6LKpvnfwRgGF+hgAfpZg4c0vrbZGTQcRhhLFECyh/aVI08F p5TOMArhVaX59664gHgSPq4KnGTXOO29dot9N3Jya/ZQnxinjY9r+GVOfLuduPPy 1B04vab8oAsk4zK7fZbkDxgYUyifwzK/vQ6OqYq2mYdpdIS/AE7T2ou61Bz5mI7I /BuucNV0Z96OKlyLEXwXXZjZgNu1TFcq6ARIBJ8L08PY64Fesj5BXabRyXkeNH26 0exyz9heeJs6OwRGfngXmS24tDSS0k74CeZX3KoePNj69u6KCn346KiU1qgntwwD E5F7AEHqCl5FjUEIWB4M1EPlfA8U0MzOL+tkx2xKJAjsU60wAy7jRSyOIcqodpMs 6UlPcJzgYg== =uuLl -----END PGP SIGNATURE----- Merge tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - MD changes via Song: - md-bitmap refactoring (Yu Kuai) - raid5 performance optimization (Artur Paszkiewicz) - Other small fixes (Yu Kuai, Chen Ni) - Add a sysfs entry 'new_level' (Xiao Ni) - Improve information reported in /proc/mdstat (Mateusz Kusiak) - NVMe changes via Keith: - Asynchronous namespace scanning (Stuart) - TCP TLS updates (Hannes) - RDMA queue controller validation (Niklas) - Align field names to the spec (Anuj) - Metadata support validation (Puranjay) - A syntax cleanup (Shen) - Fix a Kconfig linking error (Arnd) - New queue-depth quirk (Keith) - Add missing unplug trace event (Keith) - blk-iocost fixes (Colin, Konstantin) - t10-pi modular removal and fixes (Alexey) - Fix for potential BLKSECDISCARD overflow (Alexey) - bio splitting cleanups and fixes (Christoph) - Deal with folios rather than rather than pages, speeding up how the block layer handles bigger IOs (Kundan) - Use spinlocks rather than bit spinlocks in zram (Sebastian, Mike) - Reduce zoned device overhead in ublk (Ming) - Add and use sendpages_ok() for drbd and nvme-tcp (Ofir) - Fix regression in partition error pointer checking (Riyan) - Add support for write zeroes and rotational status in nbd (Wouter) - Add Yu Kuai as new BFQ maintainer. The scheduler has been unmaintained for quite a while. - Various sets of fixes for BFQ (Yu Kuai) - Misc fixes and cleanups (Alvaro, Christophe, Li, Md Haris, Mikhail, Yang) * tag 'for-6.12/block-20240913' of git://git.kernel.dk/linux: (120 commits) nvme-pci: qdepth 1 quirk block: fix potential invalid pointer dereference in blk_add_partition blk_iocost: make read-only static array vrate_adj_pct const block: unpin user pages belonging to a folio at once mm: release number of pages of a folio block: introduce folio awareness and add a bigger size from folio block: Added folio-ized version of bio_add_hw_page() block, bfq: factor out a helper to split bfqq in bfq_init_rq() block, bfq: remove local variable 'bfqq_already_existing' in bfq_init_rq() block, bfq: remove local variable 'split' in bfq_init_rq() block, bfq: remove bfq_log_bfqg() block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator() block, bfq: fix procress reference leakage for bfqq in merge chain block, bfq: fix uaf for accessing waker_bfqq after splitting blk-throttle: support prioritized processing of metadata blk-throttle: remove last_low_overflow_time drbd: Add NULL check for net_conf to prevent dereference in state validation nvme-tcp: fix link failure for TCP auth blk-mq: add missing unplug trace event mtip32xx: Remove redundant null pointer checks in mtip_hw_debugfs_init() ...	2024-09-16 13:33:06 +02:00
Linus Torvalds	8f72c31f45	vfs-6.12.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ= =9zGL -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT \| O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT \| O_EXCL will return ENOENT but the second openat() with O_CREAT \| O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT \| O_EXCL follows the symlink while O_CREAT \| O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...	2024-09-16 08:35:09 +02:00
Mikulas Patocka	c8691cd0fc	Revert "dm: requeue IO if mapping table not yet available" This reverts commit `fa247089de`. The following sequence of commands causes a livelock - there will be workqueue process looping and consuming 100% CPU: dmsetup create --notable test truncate -s 1MiB testdata losetup /dev/loop0 testdata dmsetup load test --table '0 2048 linear /dev/loop0 0' dd if=/dev/zero of=/dev/dm-0 bs=16k count=1 conv=fdatasync The livelock is caused by the commit `fa247089de`. The commit claims that it fixes a race condition, however, it is unknown what the actual race condition is and what program is involved in the race condition. When the inactive table is loaded, the nodes /dev/dm-0 and /sys/block/dm-0 are created. /dev/dm-0 has zero size at this point. When the device is suspended and resumed, the nodes /dev/mapper/test and /dev/disk/* are created. If some program opens a block device before it is created by dmsetup or lvm, the program is buggy, so dm could just report an error as it used to do before. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: `fa247089de` ("dm: requeue IO if mapping table not yet available")	2024-09-15 21:02:54 +02:00
Linus Torvalds	3857c7b041	- fix a race condition in dm-integrity -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZuGERRQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbQDOAQDP5McFAyla+jgWuuZWr3yL6sOibQuN mHAIwFsiKhdEgwEA5lfHaI6gVCycoiGpmD5EhL4sZSgA4B3e3GXuHb/lRQ8= =wFGJ -----END PGP SIGNATURE----- Merge tag 'for-6.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - fix a race condition in dm-integrity * tag 'for-6.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-integrity: fix a race condition when accessing recalc_sector	2024-09-11 11:21:50 -07:00
Eric Biggers	9c2010bccc	dm-integrity: check mac_size against HASH_MAX_DIGESTSIZE in sb_mac() sb_mac() verifies that the superblock + MAC don't exceed 512 bytes. Because the superblock is currently 64 bytes, this really verifies mac_size <= 448. This confuses smatch into thinking that mac_size may be as large as 448, which is inconsistent with the later code that assumes the MAC fits in a buffer of size HASH_MAX_DIGESTSIZE (64). In fact mac_size <= HASH_MAX_DIGESTSIZE is guaranteed by the crypto API, as that is the whole point of HASH_MAX_DIGESTSIZE. But, let's be defensive and explicitly check for this. This suppresses the false positive smatch warning. It does not fix an actual bug. Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Closes: https://lore.kernel.org/r/202409061401.44rtN1bh-lkp@intel.com/ Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-11 14:04:41 +02:00
Xiao Ni	d981ed8419	md: Add new_level sysfs interface Now reshape supports two ways: with backup file or without backup file. For the situation without backup file, it needs to change data offset. It doesn't need systemd service mdadm-grow-continue. So it can finish the reshape job in one process environment. It can know the new level from mdadm --grow command and can change to new level after reshape finishes. For the situation with backup file, it needs systemd service mdadm-grow-continue to monitor reshape progress. So there are two process envolved. One is mdadm --grow command whick kicks off reshape and wakes up mdadm-grow-continue service. The second process is the service, which doesn't know the new level from the first process. In kernel space mddev->new_level is used to record the new level when doing reshape. This patch adds a new interface to help mdadm update new_level and sync it to metadata. Then mdadm-grow-continue can read the right new_level. Commit log revised by Song Liu. Please refer to the link for more details. Signed-off-by: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20240904235453.99120-1-xni@redhat.com Signed-off-by: Song Liu <song@kernel.org>	2024-09-06 10:31:12 -07:00
Mikulas Patocka	f8e1ca92e3	dm-integrity: fix a race condition when accessing recalc_sector There's a race condition when accessing the variable ic->sb->recalc_sector. The function integrity_recalc writes to this variable when it makes some progress and the function dm_integrity_map_continue may read this variable concurrently. One problem is that on 32-bit architectures the 64-bit variable is not read and written atomically - it may be possible to read garbage if read races with write. Another problem is that memory accesses to this variable are not guarded with memory barriers. This commit fixes the race - it moves reading ic->sb->recalc_sector to an earlier place where we hold &ic->endio_wait.lock. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2024-09-06 12:38:16 +02:00
Mikulas Patocka	90da77987d	dm-integrity: support recalculation in the 'I' mode In the kernel 6.11, dm-integrity was enhanced with an inline ('I') mode. This mode uses devices with non-power-of-2 sector size. The extra metadata after each sector are used to hold the integrity hash. This commit enhances the inline mode, so that there is automatic recalculation of the integrity hashes when the 'reclaculate' parameter is used. It allows us to activate the device instantly, and the recalculation is done on background. If the device is deactivated while recalculation is in progress, it will remember the point where it stopped and it will continue from this point when activated again. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-06 12:34:05 +02:00
Chen Ni	a7722b82c2	dm integrity: Convert comma to semicolon Replace comma between expressions with semicolons. Using a ',' in place of a ';' can have unintended side effects. Although that is not the case here, it is seems best to use ';' unless ',' is intended. Found by inspection. No functional change intended. Compile tested only. Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-06 12:33:42 +02:00
Mateusz Kusiak	2d2b3bc145	md: Report failed arrays as broken in mdstat Depending on if array has personality, it is either reported as active or inactive. This patch adds third status "broken" for arrays with personality that became inoperative. The reason is end users tend to assume that "active" indicates array is operational. Add "broken" state for inoperative arrays with personality and refactor the code. Signed-off-by: Mateusz Kusiak <mateusz.kusiak@intel.com> Link: https://lore.kernel.org/r/20240903142949.53628-1-mateusz.kusiak@intel.com Signed-off-by: Song Liu <song@kernel.org>	2024-09-04 14:52:45 -07:00
Mikulas Patocka	a8fa6483b4	dm integrity: fix gcc 5 warning This commit fixes gcc 5 warning "logical not is only applied to the left hand side of comparison" Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Fixes: `fb0987682c` ("dm-integrity: introduce the Inline mode") Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-03 12:08:20 +02:00
Hongbo Li	26207c6332	dm: Make use of __assign_bit() API We have for some time the __assign_bit() API to replace open coded if (foo) __set_bit(n, bar); else __clear_bit(n, bar); Use this API to simplify the code. No functional change intended. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-02 16:53:53 +02:00
Hongbo Li	35c9f09b56	dm integrity: Remove extra unlikely helper In IS_ERR, the unlikely is used for the input parameter, so these is no need to use it again outside. Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-02 11:41:11 +02:00
Yuesong Li	00565cff01	dm: Convert to use ERR_CAST() Use ERR_CAST() as it is designed for casting an error pointer to another type. This macro utilizes the __force and __must_check modifiers, which instruct the compiler to verify for errors at the locations where it is employed. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-09-02 11:41:11 +02:00
Michal Hocko	5c40e050e6	fs: drop GFP_NOFAIL mode from alloc_page_buffers There is only one called of alloc_page_buffers and it doesn't require __GFP_NOFAIL so drop this allocation mode. Signed-off-by: Michal Hocko <mhocko@suse.com> Link: https://lore.kernel.org/r/20240829130640.1397970-1-mhocko@kernel.org Acked-by: Song Liu <song@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-08-30 14:54:03 +02:00
Artur Paszkiewicz	6f039cc42f	md/raid5: rename wait_for_overlap to wait_for_reshape The only remaining uses of wait_for_overlap are related to reshape so rename it accordingly. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20240827153536.6743-4-artur.paszkiewicz@intel.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-29 09:37:10 -07:00
Artur Paszkiewicz	0e4aac7366	md/raid5: only add to wq if reshape is in progress Now that actual overlaps are not handled on the wait_for_overlap wq anymore, the remaining cases when we wait on this wq are limited to reshape. If reshape is not in progress, don't add to the wq in raid5_make_request() because add_wait_queue() / remove_wait_queue() operations take a spinlock and cause noticeable contention when multiple threads are submitting requests to the mddev. Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20240827153536.6743-3-artur.paszkiewicz@intel.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-29 09:37:10 -07:00
Artur Paszkiewicz	e6a03207b9	md/raid5: use wait_on_bit() for R5_Overlap Convert uses of wait_for_overlap wait queue with R5_Overlap bit to wait_on_bit() / wake_up_bit(). Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com> Link: https://lore.kernel.org/r/20240827153536.6743-2-artur.paszkiewicz@intel.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-29 09:37:10 -07:00
Song Liu	7f67fdae33	Merge branch 'md-6.12-bitmap' into md-6.12 From Yu Kuai (with minor changes by Song Liu): The background is that currently bitmap is using a global spin_lock, causing lock contention and huge IO performance degradation for all raid levels. However, it's impossible to implement a new lock free bitmap with current situation that md-bitmap exposes the internal implementation with lots of exported apis. Hence bitmap_operations is invented, to describe bitmap core implementation, and a new bitmap can be introduced with a new bitmap_operations, we only need to switch to the new one during initialization. And with this we can build bitmap as kernel module, but that's not our concern for now. This version was tested with mdadm tests and lvm2 tests. This set does not introduce new errors in these tests. * md-6.12-bitmap: (42 commits) md/md-bitmap: make in memory structure internal md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations md/md-bitmap: merge md_bitmap_free() into bitmap_operations md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation. md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations md/md-bitmap: merge md_bitmap_resize() into bitmap_operations md/md-bitmap: pass in mddev directly for md_bitmap_resize() md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations md/md-bitmap: merge bitmap_unplug() into bitmap_operations md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug() md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync() md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations ... Signed-off-by: Song Liu <song@kernel.org>	2024-08-28 14:55:57 -07:00
Yu Kuai	b75197e86e	md: Remove flush handling For flush request, md has a special flush handling to merge concurrent flush request into single one, however, the whole mechanism is based on a disk level spin_lock 'mddev->lock'. And fsync can be called quite often in some user cases, for consequence, spin lock from IO fast path can cause performance degradation. Fortunately, the block layer already has flush handling to merge concurrent flush request, and it only acquires hctx level spin lock. (see details in blk-flush.c) This patch removes the flush handling in md, and converts to use general block layer flush handling in underlying disks. Flush test for 4 nvme raid10: start 128 threads to do fsync 100000 times, on arm64, see how long it takes. Test script: void* thread_func(void* arg) { int fd = (int)arg; for (int i = 0; i < FSYNC_COUNT; i++) { fsync(fd); } return NULL; } int main() { int fd = open("/dev/md0", O_RDWR); if (fd < 0) { perror("open"); exit(1); } pthread_t threads[THREADS]; struct timeval start, end; gettimeofday(&start, NULL); for (int i = 0; i < THREADS; i++) { pthread_create(&threads[i], NULL, thread_func, &fd); } for (int i = 0; i < THREADS; i++) { pthread_join(threads[i], NULL); } gettimeofday(&end, NULL); close(fd); long long elapsed = (end.tv_sec - start.tv_sec) * 1000000LL + (end.tv_usec - start.tv_usec); printf("Elapsed time: %lld microseconds\n", elapsed); return 0; } Test result: about 10 times faster: Before this patch: 50943374 microseconds After this patch: `5096347` microseconds Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240827110616.3860190-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 17:19:55 -07:00
Yu Kuai	59fdd43304	md/md-bitmap: make in memory structure internal Now that struct bitmap_page and bitmap is not used externally anymore, move them from md-bitmap.h to md-bitmap.c (expect that dm-raid is still using define marco 'COUNTER_MAX'). Also fix some checkpatch warnings. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-43-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:16 -07:00
Yu Kuai	dab2ce5534	md/md-bitmap: merge md_bitmap_enabled() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-42-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:16 -07:00
Yu Kuai	49f5f5e309	md/md-bitmap: merge md_bitmap_wait_behind_writes() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-41-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:16 -07:00
Yu Kuai	c65c20dc50	md/md-bitmap: merge md_bitmap_free() into bitmap_operations So that the implementation won't be exposed, and it'll be possible o invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-40-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	ef1c400faf	md/md-bitmap: merge md_bitmap_set_pages() into struct bitmap_operations o that the implementation won't be exposed, and it'll be possible o invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-39-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	3dd9141a15	md/md-bitmap: merge md_bitmap_copy_from_slot() into struct bitmap_operation. So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-38-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	57d602414d	md/md-bitmap: merge get_bitmap_from_slot() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-37-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	77c09640ee	md/md-bitmap: merge md_bitmap_resize() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-36-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	e1791dae6c	md/md-bitmap: pass in mddev directly for md_bitmap_resize() And move the condition "if (mddev->bitmap)" into md_bitmap_resize() as well, on the one hand make code cleaner, on the other hand try not to access bitmap directly. Since we are here, also change the parameter 'init' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-35-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:15 -07:00
Yu Kuai	18db2a9c60	md/md-bitmap: merge md_bitmap_daemon_work() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-34-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:14 -07:00
Yu Kuai	3c9883e77a	md/md-bitmap: merge bitmap_unplug() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-33-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:14 -07:00
Yu Kuai	48eb95810a	md/md-bitmap: merge md_bitmap_unplug_async() into md_bitmap_unplug() Add a parameter 'bool sync' to distinguish them, and md_bitmap_unplug_async() won't be exported anymore, hence bitmap_operations only need one op to cover them. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-32-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:14 -07:00
Yu Kuai	4338b94271	md/md-bitmap: merge md_bitmap_sync_with_cluster() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-31-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:43:14 -07:00
Yu Kuai	15db1eca63	md/md-bitmap: merge md_bitmap_cond_end_sync() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-30-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 12:42:42 -07:00
Yu Kuai	077b18abde	md/md-bitmap: merge md_bitmap_close_sync() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-29-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	1415f402e1	md/md-bitmap: merge md_bitmap_end_sync() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-28-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	9be669bd1b	md/md-bitmap: remove the parameter 'aborted' for md_bitmap_end_sync() For internal callers, aborted are always set to false, while for external callers, aborted are always set to true. Hence there is no need to always pass in true for exported api. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-27-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	fe6a19d40c	md/md-bitmap: merge md_bitmap_start_sync() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Also fix lots of code style. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-26-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	3486015fac	md/md-bitmap: merge md_bitmap_endwrite() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And change the type of 'success' and 'behind' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-25-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	c2257df410	md/md-bitmap: merge md_bitmap_startwrite() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And change the type of 'behind' from int to bool. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-24-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	2d3b130e17	md/md-bitmap: merge md_bitmap_dirty_bits() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. And while we're here, also fix coding style for bitmap_store(). Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-23-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:17 -07:00
Yu Kuai	b26313cb96	md/md-bitmap: merge bitmap_write_all() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Also change the parameter from bitmap to mddev, to avoid access bitmap outside md-bitmap.c as much as possible. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-22-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	ea076ceb35	md/md-bitmap: remove md_bitmap_setallbits() md_bitmap_setallbits() is not used, hence can be removed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-21-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	696936838b	md/md-bitmap: merge md_bitmap_status() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-20-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	fe59b34676	md/md-bitmap: merge md_bitmap_update_sb() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-19-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	a0240e3ec7	md/md-bitmap: make md_bitmap_print_sb() internal md_bitmap_print_sb() is only used inside md-bitmap.c, hence make it static, also rename it to bitmap_print_sb. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-18-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	ca925302e8	md/md-bitmap: merge md_bitmap_flush() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-17-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	a2bd703192	md/md-bitmap: merge md_bitmap_destroy() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-16-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	e1e4908059	md/md-bitmap: merge md_bitmap_load() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-15-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:16 -07:00
Yu Kuai	04c80e6495	md/md-bitmap: merge md_bitmap_create() into bitmap_operations So that the implementation won't be exposed, and it'll be possible to invent a new bitmap by replacing bitmap_operations. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-14-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	7545d385ec	md/md-bitmap: simplify md_bitmap_create() + md_bitmap_load() Other than internal api get_bitmap_from_slot(), all other places will set returned bitmap to mddev->bitmap. So move the setting of mddev->bitmap into md_bitmap_create() to simplify code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-13-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	7add9db6ba	md/md-bitmap: introduce struct bitmap_operations The structure is empty for now, and will be used in later patches to merge in bitmap operations, so that bitmap implementation won't be exposed. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-12-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	27832ad3f7	md/md-bitmap: add a new helper md_bitmap_set_pages() Currently md-cluster will set bitmap->counts.pages directly, add a helper to do this to avoid dereferencing bitmap directly. Noted that after this patch bitmap is not dereferenced directly anymore and following patches will move the structure inside md-bitmap.c. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-11-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	9e4481ce0e	md/md-cluster: use helper md_bitmap_get_stats() to get pages in resize_bitmaps() Use the existed helper instead of open coding it, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-10-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	a0e7744a46	md/md-bitmap: add 'behind_writes' and 'behind_wait' into struct md_bitmap_stats There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Also fix following checkpatch warning by using wq_has_sleeper(). WARNING: waitqueue_active without comment Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-9-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	10bc2ac105	md/md-bitmap: add 'file_pages' into struct md_bitmap_stats There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-8-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	ec6bb299c7	md/md-bitmap: add 'sync_size' into struct md_bitmap_stats To avoid dereferencing bitmap directly in md-cluster to prepare inventing a new bitmap. BTW, also fix following checkpatch warnings: WARNING: Deprecated use of 'kmap_atomic', prefer 'kmap_local_page' instead WARNING: Deprecated use of 'kunmap_atomic', prefer 'kunmap_local' instead Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-7-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:15 -07:00
Yu Kuai	82697ccf7e	md/md-cluster: fix spares warnings for __le64 drivers/md/md-cluster.c:1220:22: warning: incorrect type in assignment (different base types) drivers/md/md-cluster.c:1220:22: expected unsigned long my_sync_size drivers/md/md-cluster.c:1220:22: got restricted __le64 [usertype] sync_size drivers/md/md-cluster.c:1252:35: warning: incorrect type in assignment (different base types) drivers/md/md-cluster.c:1252:35: expected unsigned long sync_size drivers/md/md-cluster.c:1252:35: got restricted __le64 [usertype] sync_size drivers/md/md-cluster.c:1253:41: warning: restricted __le64 degrades to integer Fix the warnings by using le64_to_cpu() to convet __le64 to integer. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-6-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:14 -07:00
Yu Kuai	d004442f46	md/md-bitmap: add 'events_cleared' into struct md_bitmap_stats Also add a new helper to get events_cleared to avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-5-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:14 -07:00
Yu Kuai	9681538122	md: use new helper md_bitmap_get_stats() in update_array_info() There are no functional changes, avoid dereferencing bitmap directly to prepare inventing a new bitmap. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-4-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:14 -07:00
Yu Kuai	38f287d7e4	md/md-bitmap: replace md_bitmap_status() with a new helper md_bitmap_get_stats() There are no functional changes, and the new helper will be used in multiple places in following patches to avoid dereferencing bitmap directly. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-3-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:14 -07:00
Yu Kuai	2db4fa1b7e	md/raid1: use md_bitmap_wait_behind_writes() in raid1_read_request() Use the existed helper instead of open coding it to make the code cleaner. There are no functional changes, and also avoid dereferencing bitmap directly to prepare inventing a new bitmap. Noted that this patch also export md_bitmap_wait_behind_writes(), which is necessary for now, and the exported api will be removed in following patches to convert bitmap apis into ops. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240826074452.1490072-2-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:14:14 -07:00
Yu Kuai	2d389a759d	md/raid1: Clean up local variable 'b' from raid1_read_request() The local variable will only be used onced, in the error path that read_balance() failed to find a valid rdev to read. Since now the rdev is ensured can't be removed from conf while IO is still pending, remove the local variable and dereference rdev directly. Since we're here, also remove an extra empty line, and unnecessary type conversion from sector_t(u64) to unsigned long long. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240801133008.459998-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 10:08:22 -07:00
Yu Kuai	86ad4cda79	md: Don't flush sync_work in md_write_start() Because flush sync_work may trigger mddev_suspend() if there are spares, and this should never be done in IO path because mddev_suspend() is used to wait for IO. This problem is found by code review. Fixes: `bc08041b32` ("md: suspend array in md_start_sync() if array need reconfiguration") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240801124746.242558-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-27 09:55:16 -07:00
Yuesong Li	02c0207ecd	dm bufio: Remove NULL check of list_entry() list_entry() will never return a NULL pointer, thus remove the check. Signed-off-by: Yuesong Li <liyuesong@vivo.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-22 18:32:55 +02:00
Ingo Franzki	4441686b24	dm-crypt: Allow to specify the integrity key size as option For the MAC based integrity operation, the integrity key size (i.e. key_mac_size) is currently set to the digest size of the used digest. For wrapped key HMAC algorithms, the key size is independent of the cryptographic key size. So there is no known size of the mac key in such cases. The desired key size can optionally be specified as argument when the dm-crypt device is configured via 'integrity_key_size:%u'. If no integrity_key_size argument is specified, the mac key size is still set to the digest size, as before. Increase version number to 1.28.0 so that support for the new argument can be detected by user space (i.e. cryptsetup). Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 15:36:27 +02:00
Zhang Zekun	f3631ae11d	dm: Remove unused declaration and empty definition "dm_zone_map_bio" dm_zone_map_bio() has beed removed since commit `f211268ed1` ("dm: Use the block layer zone append emulation"), remain the declaration unused in header files. So, let's remove this unused declaration and empty definition. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:12:12 +02:00
Susan LeGendre-McGhee	448c4e4eb1	dm vdo: force read-only mode for a corrupt recovery journal Ensure the recovery journal does not attempt recovery when blocks with mismatched metadata versions are detected. This check is performed after determining that the blocks are otherwise valid so that it does not interfere with normal recovery. Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:12:12 +02:00
Susan LeGendre-McGhee	f3ff668352	dm vdo: abort loading dirty VDO with the old recovery journal format Abort the load process with status code VDO_UNSUPPORTED_VERSION without forcing read-only mode when a journal block with the old format version is detected. Forcing the VDO volume into read-only mode and thus requiring a read-only rebuild should only be done when absolutely necessary. Signed-off-by: Susan LeGendre-McGhee <slegendr@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:11:34 +02:00
Bruce Johnston	47874c98dc	dm vdo: add dmsetup message for returning configuration info Add a new dmsetup message called config, which will return useful configuration information for the vdo volume and the uds index associated with it. The output is a YAML string, and contains a version number to allow future additions to the content. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:05:56 +02:00
Ken Raeburn	3a59b2ec24	dm vdo: remove bad check of bi_next field Remove this check to prevent spurious warning messages due to the behavior of other storage layers. This check was intended to make sure dm-vdo does not chain metadata bios together. However, vdo has no control over underlying storage layers, so the assertion is not always true. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:05:56 +02:00
Ken Raeburn	0808ebf2f8	dm vdo: don't refer to dedupe_context after releasing it Clear the dedupe_context pointer in a data_vio whenever ownership of the context is lost, so that vdo can't examine it accidentally. Signed-off-by: Ken Raeburn <raeburn@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-08-21 13:05:56 +02:00
Deven Bowers	a6af7bc3d7	dm-verity: expose root hash digest and signature data to LSMs dm-verity provides a strong guarantee of a block device's integrity. As a generic way to check the integrity of a block device, it provides those integrity guarantees to its higher layers, including the filesystem level. However, critical security metadata like the dm-verity roothash and its signing information are not easily accessible to the LSMs. To address this limitation, this patch introduces a mechanism to store and manage these essential security details within a newly added LSM blob in the block_device structure. This addition allows LSMs to make access control decisions on the integrity data stored within the block_device, enabling more flexible security policies. For instance, LSMs can now revoke access to dm-verity devices based on their roothashes, ensuring that only authorized and verified content is accessible. Additionally, LSMs can enforce policies to only allow files from dm-verity devices that have a valid digital signature to execute, effectively blocking any unsigned files from execution, thus enhancing security against unauthorized modifications. The patch includes new hook calls, `security_bdev_setintegrity()`, in dm-verity to expose the dm-verity roothash and the roothash signature to LSMs via preresume() callback. By using the preresume() callback, it ensures that the security metadata is consistently in sync with the metadata of the dm-verity target in the current active mapping table. The hook calls are depended on CONFIG_SECURITY. Signed-off-by: Deven Bowers <deven.desai@linux.microsoft.com> Signed-off-by: Fan Wu <wufan@linux.microsoft.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> [PM: moved sig_size field as discussed] Signed-off-by: Paul Moore <paul@paul-moore.com>	2024-08-20 14:02:38 -04:00
Linus Torvalds	85652baa89	block-6.11-20240824 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAma/oVEQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpg1AD/wJFx5MaHVNGtgtl6av6G8oJKTnFiaPGilY 1+oQGd0JwokWhXNzs9AqhSiusgvS3cYj5hU/cvecI4d90bZ0Rl/w1h01zPmivWYt ndB3PjHKcGD4JM+gdSDbWOSG00b7jymCW977A5Kw5DYR9CNzCJNs1vVed6FDKVr3 Hke0YsXu37a2jEb63ofqR+DesmZyBlucKGqQ8J5W+PYB1PR21KV7COsKwOvp8M3x asMq2QWLVibQ2BdrINqOzdHmQMYeaG51K7mjtZENErOxxDKEkeMdcuTM4SMjyQtQ mF3YsmxfhWtqakrMqoHc+W3k8kZ7qd/qqDPOlrDLRgoPs+uGVDr7kuyyTDQJZWPK sNlPI2RtRfsPfcTPuvZQbn2ev/UoiJbjGlKiqeabg0FllsnjGz+m3MsufVGA3a2v OoXgQQVvSKdJW9Z2Nq/TBFunsnB60q5QwRUXzl/ozttPanPsw7mKTxHuzXlFFGlj s5IL63GYxWE6xSh0TElF9cMxAiaRAmNuAeK9279ULPj69UNWlnJKw8wyVmBgmTXy kYn5AtHKsdj+kJFGinWQ9U0os5PZxgP2ikPOjFVigReOLY7noU4OWfcK/SgUtXMk wo0y4ROmm+/bKtOBnG+SnBkpwjOUI4kfVRXfOYQAl5IOxFIHUwFIhxGcNjTvSW4P KgY/8TQvog== =lbw7 -----END PGP SIGNATURE----- Merge tag 'block-6.11-20240824' of git://git.kernel.dk/linux Pull block fixes from Jens Axboe: - Fix corruption issues with s390/dasd (Eric, Stefan) - Fix a misuse of non irq locking grab of a lock (Li) - MD pull request with a single data corruption fix for raid1 (Yu) * tag 'block-6.11-20240824' of git://git.kernel.dk/linux: block: Fix lockdep warning in blk_mq_mark_tag_wait md/raid1: Fix data corruption for degraded array with slow disk s390/dasd: fix error recovery leading to data corruption on ESE devices s390/dasd: Remove DMA alignment	2024-08-16 14:03:31 -07:00
Yu Kuai	c916ca3530	md/raid1: Fix data corruption for degraded array with slow disk read_balance() will avoid reading from slow disks as much as possible, however, if valid data only lands in slow disks, and a new normal disk is still in recovery, unrecovered data can be read: raid1_read_request read_balance raid1_should_read_first -> return false choose_best_rdev -> normal disk is not recovered, return -1 choose_bb_rdev -> missing the checking of recovery, return the normal disk -> read unrecovered data Root cause is that the checking of recovery is missing in choose_bb_rdev(). Hence add such checking to fix the problem. Also fix similar problem in choose_slow_rdev(). Cc: stable@vger.kernel.org Fixes: `9f3ced7922` ("md/raid1: factor out choose_bb_rdev() from read_balance()") Fixes: `dfa8ecd167` ("md/raid1: factor out choose_slow_rdev() from read_balance()") Reported-and-tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Closes: https://lore.kernel.org/all/9952f532-2554-44bf-b906-4880b2e88e3a@o2.pl/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240803091137.3197008-1-yukuai1@huaweicloud.com Signed-off-by: Song Liu <song@kernel.org>	2024-08-15 13:38:17 -07:00
Chen Ni	ca958879ad	md: convert comma to semicolon Replace a comma between expression statements by a semicolon. Fixes: `5e5702898e` ("md/raid10: Handle read errors during recovery better.") Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Link: https://lore.kernel.org/r/20240716025852.400259-1-nichen@iscas.ac.cn Signed-off-by: Song Liu <song@kernel.org>	2024-08-15 00:06:50 -07:00
Mikulas Patocka	faada2174c	dm persistent data: fix memory allocation failure kmalloc is unreliable when allocating more than 8 pages of memory. It may fail when there is plenty of free memory but the memory is fragmented. Zdenek Kabelac observed such failure in his tests. This commit changes kmalloc to kvmalloc - kvmalloc will fall back to vmalloc if the large allocation fails. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Cc: stable@vger.kernel.org	2024-08-13 21:14:21 +02:00
Khazhismel Kumykov	7a636b4f03	dm resume: don't return EINVAL when signalled If the dm_resume method is called on a device that is not suspended, the method will suspend the device briefly, before resuming it (so that the table will be swapped). However, there was a bug that the return value of dm_suspended_md was not checked. dm_suspended_md may return an error when it is interrupted by a signal. In this case, do_resume would call dm_swap_table, which would return -EINVAL. This commit fixes the logic, so that error returned by dm_suspend is checked and the resume operation is undone. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Khazhismel Kumykov <khazhy@google.com> Cc: stable@vger.kernel.org	2024-08-13 13:51:34 +02:00
Mikulas Patocka	1e1fd567d3	dm suspend: return -ERESTARTSYS instead of -EINTR This commit changes device mapper, so that it returns -ERESTARTSYS instead of -EINTR when it is interrupted by a signal (so that the ioctl can be restarted). The manpage signal(7) says that the ioctl function should be restarted if the signal was handled with SA_RESTART. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org	2024-08-13 13:50:45 +02:00
Linus Torvalds	4477b39c32	minmax: add a few more MIN_T/MAX_T users Commit `3a7e02c040` ("minmax: avoid overly complicated constant expressions in VM code") added the simpler MIN_T/MAX_T macros in order to avoid some excessive expansion from the rather complicated regular min/max macros. The complexity of those macros stems from two issues: (a) trying to use them in situations that require a C constant expression (in static initializers and for array sizes) (b) the type sanity checking and MIN_T/MAX_T avoids both of these issues. Now, in the whole (long) discussion about all this, it was pointed out that the whole type sanity checking is entirely unnecessary for min_t/max_t which get a fixed type that the comparison is done in. But that still leaves min_t/max_t unnecessarily complicated due to worries about the C constant expression case. However, it turns out that there really aren't very many cases that use min_t/max_t for this, and we can just force-convert those. This does exactly that. Which in turn will then allow for much simpler implementations of min_t()/max_t(). All the usual "macros in all upper case will evaluate the arguments multiple times" rules apply. We should do all the same things for the regular min/max() vs MIN/MAX() cases, but that has the added complexity of various drivers defining their own local versions of MIN/MAX, so that needs another level of fixes first. Link: https://lore.kernel.org/all/b47fad1d0cf8449886ad148f8c013dae@AcuMS.aculab.com/ Cc: David Laight <David.Laight@aculab.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-07-28 13:41:14 -07:00
Linus Torvalds	7d080fa867	for-6.11/block-20240722 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeZBIQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpqI7D/9XPinZuuwiZ/670P8yjk1SHFzqzdwtuFuP +Dq2lcoYRkuwm5PvCvhs3QH2mnjS1vo1SIoAijGEy3V1bs41mw87T2knKMIn4g5v I5A4gC6i0IqxIkFm17Zx9yG+MivoOmPtqM4RMxze2xS/uJwWcvg4tjBHZfylY3d9 oaIXyZj+0dTRf955K2x/5dpfE6qjtDG0bqrrJXnzaIKHBJk2HKezYFbTstAA4OY+ MvMqRL7uJmJBd7384/WColIO0b8/UEchPl7qG+zy9pg+wzQGLFyF/Z/KdjrWdDMD IFs92uNDFQmiGoyujJmXdDV9xpKi94nqDAtUR+Qct0Mui5zz0w2RNcGvyTDjBMpv CAzTkTW48moYkwLPhPmy8Ge69elT82AC/9ZQAHbA7g3TYgJML5IT/7TtiaVe6Rc1 podnTR3/e9XmZnc25aUZeAr6CG7b+0NBvB+XPO9lNyMEE38sfwShoPdAGdKX25oA mjnLHBc9grVOQzRGEx22E11k+1ChXf/o9H546PB2Pr9yvf/DQ3868a+QhHssxufL Xul1K5a+pUmOnaTLD3ESftYlFmcDOHQ6gDK697so7mU7lrD3ctN4HYZ2vwNk35YY 2b4xrABrOEbAXlUo3Ht8F/ecg6qw4xTr9vAW5q4+L2H5+28RaZKYclHhLmR23yfP xJ/d5FfVFQ== =fqoV -----END PGP SIGNATURE----- Merge tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux Pull more block updates from Jens Axboe: - MD fixes via Song: - md-cluster fixes (Heming Zhao) - raid1 fix (Mateusz Jończyk) - s390/dasd module description (Jeff) - Series cleaning up and hardening the blk-mq debugfs flag handling (John, Christoph) - blk-cgroup cleanup (Xiu) - Error polled IO attempts if backend doesn't support it (hexue) - Fix for an sbitmap hang (Yang) * tag 'for-6.11/block-20240722' of git://git.kernel.dk/linux: (23 commits) blk-cgroup: move congestion_count to struct blkcg sbitmap: fix io hung due to race on sbitmap_word::cleared block: avoid polling configuration errors block: Catch possible entries missing from rqf_name[] block: Simplify definition of RQF_NAME() block: Use enum to define RQF_x bit indexes block: Catch possible entries missing from cmd_flag_name[] block: Catch possible entries missing from alloc_policy_name[] block: Catch possible entries missing from hctx_flag_name[] block: Catch possible entries missing from hctx_state_name[] block: Catch possible entries missing from blk_queue_flag_name[] block: Make QUEUE_FLAG_x as an enum block: Relocate BLK_MQ_MAX_DEPTH block: Relocate BLK_MQ_CPU_WORK_BATCH block: remove QUEUE_FLAG_STOPPED block: Add missing entry to hctx_flag_name[] block: Add zone write plugging entry to rqf_name[] block: Add missing entries from cmd_flag_name[] s390/dasd: fix error checks in dasd_copy_pair_store() s390/dasd: add missing MODULE_DESCRIPTION() macros ...	2024-07-22 11:32:05 -07:00
Linus Torvalds	0256994887	for-6.11/block-post-20240722 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaeY00QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpjPGD/9CPo93+V/ztfzY1J18KhA2CCUh1uuxZIjx dLfi07Bo+gyLwB1vaSf0bNy9gM8SzGFSMszSIDTErNq9/F6RvWjXN0CchyQf1Wii o2UyQg8JLjT2o1pJSsdJySZQRsG/daWUHzHaX1kD343Cd6OBV2YaVFdYTaXUGg4v G1AVh7qFvQhAIg1jV8q2z7QC7PSeuTnvyvY65Z8/iVJe95FayOrtGmDPTaJab8r2 7uEFiWZk23erzNygVdcSoNIrwWFmRARz5o3IvwJJfEL08hkdoAqu6vD2oCUZspKU 3g4wU6JrN0QYQpVwIJ9WcwYcoOm6iMm9xwCVMsp8R3KRUU107HjaiEazFDGk4HW4 ozZTa7leTXnrRqnjVhcQpUvC+1uVLCFN8sSElNY7m2dg0IojnlMz+t3lMiTtaR9N Rt6wy5alVQFlb2uhzALuUh6HM1zA98swWySNoP0arTkOT9kjXwwAgn0I+M1s9Uxo FaQvM0YnAsb2C8LSpNtZWLaTlRSLTzUsGThLSJMBZueIJ9+BF23i7W7euklCNxjj Jl6CykEkEkacOxU6b9PG6qSnUq9JJ+W7gcJVing+ugAFrZDutxy6eJZXVv8wuvCC EOxaADpSs2xAaH9V0BMmwO51w0NDWySyGPHB5UBkhNjqOji/oG3FvAITiboQArgS FES4jtU1TA== =dn4l -----END PGP SIGNATURE----- Merge tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux Pull block integrity mapping updates from Jens Axboe: "A set of cleanups and fixes for the block integrity support. Sent separately from the main block changes from last week, as they depended on later fixes in the 6.10-rc cycle" * tag 'for-6.11/block-post-20240722' of git://git.kernel.dk/linux: block: don't free the integrity payload in bio_integrity_unmap_free_user block: don't free submitter owned integrity payload on I/O completion block: call bio_integrity_unmap_free_user from blk_rq_unmap_user block: don't call bio_uninit from bio_endio block: also return bio_integrity_payload * from stubs block: split integrity support out of bio.h	2024-07-22 11:04:09 -07:00
Linus Torvalds	527eff227d	- In the series "treewide: Refactor heap related implementation", Kuan-Wei Chiu has significantly reworked the min_heap library code and has taught bcachefs to use the new more generic implementation. - Yury Norov's series "Cleanup cpumask.h inclusion in core headers" reworks the cpumask and nodemask headers to make things generally more rational. - Kuan-Wei Chiu has sent along some maintenance work against our sorting library code in the series "lib/sort: Optimizations and cleanups". - More library maintainance work from Christophe Jaillet in the series "Remove usage of the deprecated ida_simple_xx() API". - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the series "nilfs2: eliminate the call to inode_attach_wb()". - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB command error". - Plus the usual shower of singleton patches all over the place. Please see the relevant changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZp2GvwAKCRDdBJ7gKXxA jlf/AP48xP5ilIHbtpAKm2z+MvGuTxJQ5VSC0UXFacuCbc93lAEA+Yo+vOVRmh6j fQF2nVKyKLYfSz7yqmCyAaHWohIYLgg= =Stxz -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: - In the series "treewide: Refactor heap related implementation", Kuan-Wei Chiu has significantly reworked the min_heap library code and has taught bcachefs to use the new more generic implementation. - Yury Norov's series "Cleanup cpumask.h inclusion in core headers" reworks the cpumask and nodemask headers to make things generally more rational. - Kuan-Wei Chiu has sent along some maintenance work against our sorting library code in the series "lib/sort: Optimizations and cleanups". - More library maintainance work from Christophe Jaillet in the series "Remove usage of the deprecated ida_simple_xx() API". - Ryusuke Konishi continues with the nilfs2 fixes and clanups in the series "nilfs2: eliminate the call to inode_attach_wb()". - Kuan-Ying Lee has some fixes to the gdb scripts in the series "Fix GDB command error". - Plus the usual shower of singleton patches all over the place. Please see the relevant changelogs for details. * tag 'mm-nonmm-stable-2024-07-21-15-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (98 commits) ia64: scrub ia64 from poison.h watchdog/perf: properly initialize the turbo mode timestamp and rearm counter tsacct: replace strncpy() with strscpy() lib/bch.c: use swap() to improve code test_bpf: convert comma to semicolon init/modpost: conditionally check section mismatch to __meminit* init: remove unused __MEMINIT* macros nilfs2: Constify struct kobj_type nilfs2: avoid undefined behavior in nilfs_cnt32_ge macro math: rational: add missing MODULE_DESCRIPTION() macro lib/zlib: add missing MODULE_DESCRIPTION() macro fs: ufs: add MODULE_DESCRIPTION() lib/rbtree.c: fix the example typo ocfs2: add bounds checking to ocfs2_check_dir_entry() fs: add kernel-doc comments to ocfs2_prepare_orphan_dir() coredump: simplify zap_process() selftests/fpu: add missing MODULE_DESCRIPTION() macro compiler.h: simplify data_race() macro build-id: require program headers to be right after ELF header resource: add missing MODULE_DESCRIPTION() ...	2024-07-21 17:56:22 -07:00
Linus Torvalds	661fb4e68c	- Optimize processing of flush bios in the dm-linear and dm-stripe targets - Dm-io cleansups and refactoring - Remove unused 'struct thunk' in dm-cache - Handle minor device numbers > 255 in dm-init - Dm-verity refactoring & enabling platform keyring - Fix warning in dm-raid - Improve dm-crypt performance - split bios to smaller pieces, so that They could be processed concurrently - Stop using blk_limits_io_{min,opt} - Dm-vdo cleanup and refactoring - Remove max_write_zeroes_granularity and max_secure_erase_granularity - Dm-multipath cleanup & refactoring - Add dm-crypt and dm-integrity support for non-power-of-2 sector size - Fix reshape in dm-raid - Make dm_block_validator const -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZpo+9xQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbYKDAQCZP2pJyh9tRZ8GsHtk3l/ZMftmk1/c 26v6vYlOTObJHAEA3TH2ahVnzhqYs/x3zEW/n91feTSeUJrrJ9DqHxWt+Ac= =S3yx -----END PGP SIGNATURE----- Merge tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - Optimize processing of flush bios in the dm-linear and dm-stripe targets - Dm-io cleansups and refactoring - Remove unused 'struct thunk' in dm-cache - Handle minor device numbers > 255 in dm-init - Dm-verity refactoring & enabling platform keyring - Fix warning in dm-raid - Improve dm-crypt performance - split bios to smaller pieces, so that They could be processed concurrently - Stop using blk_limits_io_{min,opt} - Dm-vdo cleanup and refactoring - Remove max_write_zeroes_granularity and max_secure_erase_granularity - Dm-multipath cleanup & refactoring - Add dm-crypt and dm-integrity support for non-power-of-2 sector size - Fix reshape in dm-raid - Make dm_block_validator const * tag 'for-6.11/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (33 commits) dm vdo: fix a minor formatting issue in vdo.rst dm vdo int-map: fix kerneldoc formatting dm vdo repair: add missing kerneldoc fields dm: Constify struct dm_block_validator dm-integrity: introduce the Inline mode dm: introduce the target flag mempool_needs_integrity dm raid: fix stripes adding reshape size issues dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues dm-crypt: support for per-sector NVMe metadata dm mpath: don't call dm_get_device in multipath_message dm: factor out helper function from dm_get_device dm-verity: fix dm_is_verity_target() when dm-verity is builtin dm: Remove max_secure_erase_granularity dm: Remove max_write_zeroes_granularity dm vdo indexer: use swap() instead of open coding it dm vdo: remove unused struct 'uds_attribute' dm: stop using blk_limits_io_{min,opt} dm-crypt: limit the size of encryption requests dm verity: add support for signature verification with platform keyring dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume ...	2024-07-19 10:48:44 -07:00
Matthew Sakai	513789b7fb	dm vdo int-map: fix kerneldoc formatting Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407141607.M3E2XQ0Z-lkp@intel.com/ Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-19 12:08:21 +02:00
Matthew Sakai	fa398e603f	dm vdo repair: add missing kerneldoc fields Also remove trivial comment for increment_recovery_point. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9518 Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-19 12:08:21 +02:00
Christophe JAILLET	0b60be1628	dm: Constify struct dm_block_validator 'struct dm_block_validator' are not modified in these drivers. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 32047 920 16 32983 80d7 drivers/md/dm-cache-metadata.o After: ===== text data bss dec hex filename 32075 896 16 32987 80db drivers/md/dm-cache-metadata.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-19 12:08:15 +02:00
Mikulas Patocka	fb0987682c	dm-integrity: introduce the Inline mode This commit introduces a new 'I' mode for dm-integrity. The 'I' mode may be selected if the underlying device has non-power-of-2 sector size. In this mode, dm-integrity will store integrity data directly in device's sectors and it will not use journal. This mode improves performance and reduces flash wear because there would be no journal writes. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-07-19 12:06:44 +02:00
Linus Torvalds	f097ef0e76	dlm for 6.11 - New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm to indicate callbacks can be run from softirq. - Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE. - Clean up for previous changes, e.g. unused code and parameters. - Remove custom pre-allocation of rsb structs which is unnecessary with kmem caches. - Change idr to xarray for lkb structs in use. - Change idr to xarray for rsb structs being recovered. - Change outdated naming related to internal rsb states. - Fix some incorrect add/remove of rsb on scan list. - Use rcu to free rsb structs. -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEEcGkeEvkvjdvlR90nOBtzx/yAaaoFAmaVUlgACgkQOBtzx/yA aapCfQ//eqs19no6+TUagkzboIGxGbrPEqmJNj4Vu1sCSH3tVC4IrkI2IqqPJL9N tYHUQvp3BYOdenBZzw6tmbs6cvoA7Fps7YMqqkEKYfBCHcV9KtejqvwBdJfqiN6A RniImAph0qvvI6GK4Y+6nDyxU2n8enOhgnZMRDUS/KYV8frc70SxreqzPSkPMWLh ZnDgTIF4zahUBFEkILlXYArbbRk5FKL+SMkSDZyDd78bVnjX24KgtOt7HpDX9X70 /9DrDz3uI+XShXzpIint4Ee4ghZr1lM9g9LXDazuY62SBDknhGTzY0BYVxZ2U3NG ocUh2KbJoP29sncNxLf9Nev5JPc+Wx3iCTEgLKkOEc4Yf0jAZg+1xbopWDT+qjRV djsgTCQ1gjpHgQxrlUUo7N5ilo5ocgSXSHGJ8b886tG5eZaxiN1y3TB4T4JtO+FH Q4IkFJiaYDL44xYR85wpfOcct/5mR7kPvhuxouexKobO+lKXaUONP9Wj7pRgG/M5 qhrWY4EU8jcO/nPunPxvhJdL68T3WoHDN42tWt/7kYQqY2svvfmr6NEImde6GxqX PB3hW20cvD4wULumLM+h0rQacIWuuMQ5ahIX9og6jM7Yx/ucks1pgnRo0M0R1aUc OopoTAekSdRtgbRXr5IQPRxpKB6BFUp3Va/Yo+2g0fi5QywcVZc= =dDCi -----END PGP SIGNATURE----- Merge tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm Pull dlm updates from David Teigland: - New flag DLM_LSFL_SOFTIRQ_SAFE can be set by code using dlm to indicate callbacks can be run from softirq - Change md-cluster to set DLM_LSFL_SOFTIRQ_SAFE - Clean up for previous changes, e.g. unused code and parameters - Remove custom pre-allocation of rsb structs which is unnecessary with kmem caches - Change idr to xarray for lkb structs in use - Change idr to xarray for rsb structs being recovered - Change outdated naming related to internal rsb states - Fix some incorrect add/remove of rsb on scan list - Use rcu to free rsb structs * tag 'dlm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm: dlm: add rcu_barrier before destroy kmem cache dlm: remove DLM_LSFL_SOFTIRQ from exflags fs: dlm: remove unused struct 'dlm_processed_nodes' md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace() dlm: implement LSFL_SOFTIRQ_SAFE dlm: introduce DLM_LSFL_SOFTIRQ_SAFE dlm: use LSFL_FS to check for kernel lockspace dlm: use rcu to avoid an extra rsb struct lookup dlm: fix add_scan and del_scan usage dlm: change list and timer names dlm: move recover idr to xarray datastructure dlm: move lkb idr to xarray datastructure dlm: drop own rsb pre allocation mechanism dlm: remove ls_local_handle from struct dlm_ls dlm: remove unused parameter in dlm_midcomms_addr dlm: don't kref_init rsbs created for toss list dlm: remove scand leftovers	2024-07-17 12:16:22 -07:00
Linus Torvalds	3e78198862	for-6.11/block-20240710 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmaOTd8QHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgppqIEACUr8Vv2FtezvT3OfVSlYWHHLXzkRhwEG5s vdk0o7Ow6U54sMjfymbHTgLD0ZOJf3uJ6BI95FQuW41jPzDFVbx4Hy8QzqonMkw9 1D/YQ4zrVL2mOKBzATbKpoGJzMOzGeoXEueFZ1AYPAX7RrDtP4xPQNfrcfkdE2zF LycJN70Vp6lrZZMuI9yb9ts1tf7TFzK0HJANxOAKTgSiPmBmxesjkJlhrdUrgkAU qDVyjj7u/ssndBJAb9i6Bl95Do8s9t4DeJq5/6wgKqtf5hClMXzPVB8Wy084gr6E rTRsCEhOug3qEZSqfAgAxnd3XFRNc/p2KMUe5YZ4mAqux4hpSmIQQDM/5X5K9vEv f4MNqUGlqyqntZx+KPyFpf7kLHFYS1qK4ub0FojWJEY4GrbBPNjjncLJ9+ozR0c8 kNDaFjMNAjalBee1FxNNH8LdVcd28rrCkPxRLEfO/gvBMUmvJf4ZyKmSED0v5DhY vZqKlBqG+wg0EXvdiWEHMDh9Y+q/2XBIkS6NN/Bhh61HNu+XzC838ts1X7lR+4o2 AM5Vapw+v0q6kFBMRP3IcJI/c0UcIU8EQU7axMyzWtvhog8kx8x01hIj1L4UyYYr rUdWrkugBVXJbywFuH/QIJxWxS/z4JdSw5VjASJLIrXy+aANmmG9Wonv95eyhpUv 5iv+EdRSNA== =wVi8 -----END PGP SIGNATURE----- Merge tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - NVMe updates via Keith: - Device initialization memory leak fixes (Keith) - More constants defined (Weiwen) - Target debugfs support (Hannes) - PCIe subsystem reset enhancements (Keith) - Queue-depth multipath policy (Redhat and PureStorage) - Implement get_unique_id (Christoph) - Authentication error fixes (Gaosheng) - MD updates via Song - sync_action fix and refactoring (Yu Kuai) - Various small fixes (Christoph Hellwig, Li Nan, and Ofir Gal, Yu Kuai, Benjamin Marzinski, Christophe JAILLET, Yang Li) - Fix loop detach/open race (Gulam) - Fix lower control limit for blk-throttle (Yu) - Add module descriptions to various drivers (Jeff) - Add support for atomic writes for block devices, and statx reporting for same. Includes SCSI and NVMe (John, Prasad, Alan) - Add IO priority information to block trace points (Dongliang) - Various zone improvements and tweaks (Damien) - mq-deadline tag reservation improvements (Bart) - Ignore direct reclaim swap writes in writeback throttling (Baokun) - Block integrity improvements and fixes (Anuj) - Add basic support for rust based block drivers. Has a dummy null_blk variant for now (Andreas) - Series converting driver settings to queue limits, and cleanups and fixes related to that (Christoph) - Cleanup for poking too deeply into the bvec internals, in preparation for DMA mapping API changes (Christoph) - Various minor tweaks and fixes (Jiapeng, John, Kanchan, Mikulas, Ming, Zhu, Damien, Christophe, Chaitanya) * tag 'for-6.11/block-20240710' of git://git.kernel.dk/linux: (206 commits) floppy: add missing MODULE_DESCRIPTION() macro loop: add missing MODULE_DESCRIPTION() macro ublk_drv: add missing MODULE_DESCRIPTION() macro xen/blkback: add missing MODULE_DESCRIPTION() macro block/rnbd: Constify struct kobj_type block: take offset into account in blk_bvec_map_sg again block: fix get_max_segment_size() warning loop: Don't bother validating blocksize virtio_blk: Don't bother validating blocksize null_blk: Don't bother validating blocksize block: Validate logical block size in blk_validate_limits() virtio_blk: Fix default logical block size fallback nvmet-auth: fix nvmet_auth hash error handling nvme: implement ->get_unique_id block: pass a phys_addr_t to get_max_segment_size block: add a bvec_phys helper blk-lib: check for kill signal in ioctl BLKZEROOUT block: limit the Write Zeroes to manually writing zeroes fallback block: refacto blkdev_issue_zeroout block: move read-only and supported checks into (__)blkdev_issue_zeroout ...	2024-07-15 14:20:22 -07:00
Mikulas Patocka	617069741d	dm: introduce the target flag mempool_needs_integrity This commit introduces the dm target flag mempool_needs_integrity. When the flag is set, device mapper will call bioset_integrity_create on it's bio sets. The target can then call bio_integrity_alloc on the bios allocated from the table's mempool. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-07-12 12:39:06 -04:00
Mateusz Jończyk	36a5c03f23	md/raid1: set max_sectors during early return from choose_slow_rdev() Linux 6.9+ is unable to start a degraded RAID1 array with one drive, when that drive has a write-mostly flag set. During such an attempt, the following assertion in bio_split() is hit: BUG_ON(sectors <= 0); Call Trace: ? bio_split+0x96/0xb0 ? exc_invalid_op+0x53/0x70 ? bio_split+0x96/0xb0 ? asm_exc_invalid_op+0x1b/0x20 ? bio_split+0x96/0xb0 ? raid1_read_request+0x890/0xd20 ? __call_rcu_common.constprop.0+0x97/0x260 raid1_make_request+0x81/0xce0 ? __get_random_u32_below+0x17/0x70 ? new_slab+0x2b3/0x580 md_handle_request+0x77/0x210 md_submit_bio+0x62/0xa0 __submit_bio+0x17b/0x230 submit_bio_noacct_nocheck+0x18e/0x3c0 submit_bio_noacct+0x244/0x670 After investigation, it turned out that choose_slow_rdev() does not set the value of max_sectors in some cases and because of it, raid1_read_request calls bio_split with sectors == 0. Fix it by filling in this variable. This bug was introduced in commit `dfa8ecd167` ("md/raid1: factor out choose_slow_rdev() from read_balance()") but apparently hidden until commit `0091c5a269` ("md/raid1: factor out helpers to choose the best rdev from read_balance()") shortly thereafter. Cc: stable@vger.kernel.org # 6.9.x+ Signed-off-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Fixes: `dfa8ecd167` ("md/raid1: factor out choose_slow_rdev() from read_balance()") Cc: Song Liu <song@kernel.org> Cc: Yu Kuai <yukuai3@huawei.com> Cc: Paul Luse <paul.e.luse@linux.intel.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com> Link: https://lore.kernel.org/linux-raid/20240706143038.7253-1-mat.jonczyk@o2.pl/ -- Tested on both Linux 6.10 and 6.9.8. Inside a VM, mdadm testsuite for RAID1 on 6.10 did not find any problems: ./test --dev=loop --no-error --raidtype=raid1 (on 6.9.8 there was one failure, caused by external bitmap support not compiled in). Notes: - I was reliably getting deadlocks when adding / removing devices on such an array - while the array was loaded with fsstress with 20 concurrent processes. When the array was idle or loaded with fsstress with 8 processes, no such deadlocks happened in my tests. This occurred also on unpatched Linux 6.8.0 though, but not on 6.1.97-rc1, so this is likely an independent regression (to be investigated). - I was also getting deadlocks when adding / removing the bitmap on the array in similar conditions - this happened on Linux 6.1.97-rc1 also though. fsstress with 8 concurrent processes did cause it only once during many tests. - in my testing, there was once a problem with hot adding an internal bitmap to the array: mdadm: Cannot add bitmap while array is resyncing or reshaping etc. mdadm: failed to set internal bitmap. even though no such reshaping was happening according to /proc/mdstat. This seems unrelated, though. Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240711202316.10775-1-mat.jonczyk@o2.pl	2024-07-12 01:30:38 +00:00
Heming Zhao	35a0a409fa	md-cluster: fix no recovery job when adding/re-adding a disk The commit `db5e653d7c` ("md: delay choosing sync action to md_start_sync()") delays the start of the sync action. In a clustered environment, this will cause another node to first activate the spare disk and skip recovery. As a result, no nodes will perform recovery when a disk is added or re-added. Before db5e653d7c9f: ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb \| sendmsg: METADATA_UPDATED + md_choose_sync_action process_metadata_update \| remove_and_add_spares //node1 has not finished adding + call mddev->sync_work //the spare disk:do nothing md_start_sync starts md_do_sync md_do_sync + grabbed resync_lockres:DLM_LOCK_EX + do syncing job md_check_recovery sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... md_do_sync waiting to grab resync_lockres:EX ``` After db5e653d7c9f: (note: if 'cmd:idle' sets MD_RECOVERY_INTR after md_check_recovery starts md_start_sync, setting the INTR action will exacerbate the delay in node1 calling the md_do_sync function.) ``` node1 node2 ---------------------------------------------------------------- md_check_recovery + md_update_sb \| sendmsg: METADATA_UPDATED + calls mddev->sync_work process_metadata_update //node1 has not finished adding //the spare disk:do nothing md_start_sync + md_choose_sync_action \| remove_and_add_spares + calls md_do_sync md_check_recovery md_update_sb sendmsg: METADATA_UPDATED process_metadata_update //activate spare disk ... ... ... ... md_do_sync + grabbed resync_lockres:EX + raid1_sync_request skip sync under conf->fullsync:0 md_do_sync 1. waiting to grab resync_lockres:EX 2. when node1 could grab EX lock, node1 will skip resync under recovery_offset:MaxSector ``` How to trigger: ```(commands @node1) # to easily watch the recovery status echo 2000 > /proc/sys/dev/raid/speed_limit_max ssh root@node2 "echo 2000 > /proc/sys/dev/raid/speed_limit_max" mdadm -CR /dev/md0 -l1 -b clustered -n 2 /dev/sda /dev/sdb --assume-clean ssh root@node2 mdadm -A /dev/md0 /dev/sda /dev/sdb mdadm --manage /dev/md0 --fail /dev/sda --remove /dev/sda mdadm --manage /dev/md0 --add /dev/sdc === "cat /proc/mdstat" on both node, there are no recovery action. === ``` How to fix: because md layer code logic is hard to restore for speeding up sync job on local node, we add new cluster msg to pending the another node to active disk. Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-2-heming.zhao@suse.com	2024-07-12 01:30:18 +00:00
Heming Zhao	fff42f2138	md-cluster: fix hanging issue while a new disk adding The commit `1bbe254e43` ("md-cluster: check for timeout while a new disk adding") is correct in terms of code syntax but not suite real clustered code logic. When a timeout occurs while adding a new disk, if recv_daemon() bypasses the unlock for ack_lockres:CR, another node will be waiting to grab EX lock. This will cause the cluster to hang indefinitely. How to fix: 1. In dlm_lock_sync(), change the wait behaviour from forever to a timeout, This could avoid the hanging issue when another node fails to handle cluster msg. Another result of this change is that if another node receives an unknown msg (e.g. a new msg_type), the old code will hang, whereas the new code will timeout and fail. This could help cluster_md handle new msg_type from different nodes with different kernel/module versions (e.g. The user only updates one leg's kernel and monitors the stability of the new kernel). 2. The old code for __sendmsg() always returns 0 (success) under the design (must successfully unlock ->message_lockres). This commit makes this function return an error number when an error occurs. Fixes: `1bbe254e43` ("md-cluster: check for timeout while a new disk adding") Signed-off-by: Heming Zhao <heming.zhao@suse.com> Reviewed-by: Su Yue <glass.su@suse.com> Acked-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240709104120.22243-1-heming.zhao@suse.com	2024-07-12 01:30:17 +00:00
Linus Torvalds	43db1e03c0	- Fix broken discard for device mapper VDO target -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZpAyqhQcbXBhdG9ja2FA cmVkaGF0LmNvbQAKCRATAyx9YGnhbRJyAQD8JNzNUDd2hTJhKEGzf7RZ1cOjbp6e E0N9kWJHw37HTgD/ZJ+24Z9P2U+Uip8+tzisGCcnbHaJMM4ndc+NS/u4rQw= =65U3 -----END PGP SIGNATURE----- Merge tag 'for-6.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fix from Mikulas Patocka: - Fix broken discard for device mapper VDO target * tag 'for-6.10/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm vdo: replace max_discard_sectors with max_hw_discard_sectors	2024-07-11 15:11:14 -07:00
Bruce Johnston	d5cfecfe7f	dm vdo: replace max_discard_sectors with max_hw_discard_sectors Commit `4f563a6473` ("block: add a max_user_discard_sectors queue limit") changed block core to set max_discard_sectors to: min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) Commit `825d8bbd2f` ("dm: always manage discard support in terms of max_hw_discard_sectors") fixed most dm targetss to deal with this, by replacing max_discard_sectors with max_hw_discard_sectors. Unfortunately, dm-vdo did not get fixed at that time. Fixes: `825d8bbd2f` ("dm: always manage discard support in terms of max_hw_discard_sectors") Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-11 21:24:41 +02:00
Heinz Mauelshagen	d176fadb9e	dm raid: fix stripes adding reshape size issues Adding stripes to an existing raid4/5/6/10 mapped device grows its capacity though it'll be only made available _after_ the respective reshape finished as of MD kernel reshape semantics. Such reshaping involves moving a window forward starting at BOD reading content from previous lesser stripes and writing them back in the new layout with more stripes. Once that process finishes at end of previous data, the grown size may be announced and used. In order to avoid writing over any existing data in place, out-of-place space is added to the beginning of each data device by lvm2 before starting the reshape process. That reshape space wasn't taken into acount for data device size calculation. Fixes resulting from above: - correct event handling conditions in do_table_event() to set the device's capacity after the stripe adding reshape ended - subtract mentioned out-of-place space doing data device and array size calculations - conditionally set capacity as of superblock in preresume Testing: - passes all LVM2 RAID tests including new lvconvert-raid-reshape-size.sh one Tested-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Heinz Mauelshagen	453496b899	dm raid: move _get_reshape_sectors() as prerequisite to fixing reshape size issues rs_set_dev_and_array_sectors() needs this function to calculate device and array size properly in case leg data devices have out-of-place reshape space allocated. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Mikulas Patocka	6a6c56130a	dm-crypt: support for per-sector NVMe metadata Support per-sector NVMe metadata in dm-crypt. This commit changes dm-crypt, so that it can use NVMe metadata to store authentication information. We can put dm-crypt directly on the top of NVMe device, without using dm-integrity. This commit improves write throughput twice, becase the will be no writes to the dm-integrity journal. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Benjamin Marzinski	a48f6b82c5	dm mpath: don't call dm_get_device in multipath_message When mutipath_message is called with an action and a device, it needs to find the pgpath that matches that device. dm_get_device() is not the right function for this. dm_get_device() will look for a table_device matching the requested path in use by either the live or inactive table. If it doesn't find the device, dm_get_device() will open it and add it to the table. Means that multipath_message will accept any block device, add it to the table if not present, and then look through the pgpaths to see if it finds a match. Afterwards it will remove the device if it was not previously in the table devices list. This is the only function that can modify the device list of a table besides the constructors and destructors, and it can only do this when it was passed an invalid message. Instead, multipath_message() should call dm_devt_from_path() to get the device dev_t, and match that against its pgpaths. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Benjamin Marzinski	a21f9edb13	dm: factor out helper function from dm_get_device Factor out a helper function, dm_devt_from_path(), from dm_get_device() for use in dm targets. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Eric Biggers	3708c72695	dm-verity: fix dm_is_verity_target() when dm-verity is builtin When CONFIG_DM_VERITY=y, dm_is_verity_target() returned true for any builtin dm target, not just dm-verity. Fix this by checking for verity_target instead of THIS_MODULE (which is NULL for builtin code). Fixes: `b6c1c5745c` ("dm: Add verity helpers for LoadPin") Cc: stable@vger.kernel.org Cc: Matthias Kaehlcke <mka@chromium.org> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Damien Le Moal	9d45db03ac	dm: Remove max_secure_erase_granularity The max_secure_erase_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Damien Le Moal	396a27e912	dm: Remove max_write_zeroes_granularity The max_write_zeroes_granularity boolean of struct dm_target is used in __process_abnormal_io() but never set by any target. Remove this field and the dead code using it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Jiapeng Chong	7017ded001	dm vdo indexer: use swap() instead of open coding it Use existing swap() macro rather than duplicating its implementation. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9173 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Dr. David Alan Gilbert	b956d1a30f	dm vdo: remove unused struct 'uds_attribute' 'uds_attribute' is unused since commit `a9da0fb6d8` ("dm vdo: remove all sysfs interfaces"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Christoph Hellwig	0a94a469a4	dm: stop using blk_limits_io_{min,opt} Remove use of the blk_limits_io_{min,opt} and assign the values directly to the queue_limits structure. For the io_opt this is a completely mechanical change, for io_min it removes flooring the limit to the physical and logical block size in the particular caller. But as blk_validate_limits will do the same later when actually applying the limits, there still is no change in overall behavior. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-10 13:10:06 +02:00
Mikulas Patocka	0d815e3400	dm-crypt: limit the size of encryption requests There was a performance regression reported where dm-crypt would perform worse on new kernels than on old kernels. The reason is that the old kernels split the bios to NVMe request size (that is usually 65536 or 131072 bytes) and the new kernels pass the big bios through dm-crypt and split them underneath. If a big 1MiB bio is passed to dm-crypt, dm-crypt processes it on a single core without parallelization and this is what causes the performance degradation. This commit introduces new tunable variables /sys/module/dm_crypt/parameters/max_read_size and /sys/module/dm_crypt/parameters/max_write_size that specify the maximum bio size for dm-crypt. Bios larger than this value are split, so that they can be encrypted in parallel by multiple cores. If these variables are '0', a default 131072 is used. Splitting bios may cause performance regressions in other workloads - if this happens, the user should increase the value in max_read_size and max_write_size variables. max_read_size: 128k 2399MiB/s 256k 2368MiB/s 512k 1986MiB/s 1024 1790MiB/s max_write_size: 128k 1712MiB/s 256k 1651MiB/s 512k 1537MiB/s 1024k 1332MiB/s Note that if you run dm-crypt inside a virtual machine, you may need to do "echo numa >/sys/module/workqueue/parameters/default_affinity_scope" to improve performance. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Tested-by: Laurence Oberman <loberman@redhat.com>	2024-07-10 13:09:50 +02:00
Damien Le Moal	81e7706345	dm: handle REQ_OP_ZONE_RESET_ALL This commit implements processing of the REQ_OP_ZONE_RESET_ALL operation for zoned mapped devices. Given that this operation always has a BIO sector of 0 and a 0 size, processing through the regular BIO __split_and_process_bio() function does not work because this function would always select the first target. Instead, handling of this operation is implemented using the function __send_zone_reset_all(). Similarly to the __send_empty_flush() function, the new __send_zone_reset_all() function manually goes through all targets of a mapped device table doing the following: 1) If the target can natively support REQ_OP_ZONE_RESET_ALL, __send_duplicate_bios() is used to forward the reset all operation to the target. This case is handled with the __send_zone_reset_all_native() function. 2) For other targets, the function __send_zone_reset_all_emulated() is executed to emulate the execution of REQ_OP_ZONE_RESET_ALL using regular REQ_OP_ZONE_RESET operations. Targets that can natively support REQ_OP_ZONE_RESET_ALL are identified using the new target field zone_reset_all_supported. This boolean is set to true in for targets that have reliable zone limits, that is, targets that map all sequential write required zones of their zoned device(s). Setting this field is handled in dm_set_zones_restrictions() and device_get_zone_resource_limits(). For targets with unreliable zone limits, REQ_OP_ZONE_RESET_ALL must be emulated (case 2 above). This is implemented with __send_zone_reset_all_emulated() and is similar to the block layer function blkdev_zone_reset_all_emulated(): first a report zones is done for the zones of the target to identify zones that need reset, that is, any sequential write required zone that is not already empty. This is done using a bitmap and the function dm_zone_get_reset_bitmap() which sets to 1 the bit corresponding to a zone that needs reset. Next, this zone bitmap is inspected and a clone BIO modified to use the REQ_OP_ZONE_RESET operation issued for any zone with its bit set in the zone bitmap. This implementation is more efficient than what the block layer does with blkdev_zone_reset_all_emulated(), which is always used for DM zoned devices currently: as we can natively use REQ_OP_ZONE_RESET_ALL on targets mapping all sequential write required zones, resetting all zones of a zoned mapped device can be much faster compared to always emulating this operation using regular per-zone reset. In the worst case, this implementation is as-efficient as the block layer emulation. This reduction in the time it takes to reset all zones of a zoned mapped device depends directly on the mapped device targets mapping (reliable zone limits or not). Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-05 00:42:04 -06:00
Damien Le Moal	ae7e965b36	dm: Refactor is_abnormal_io() Use a single switch-case to simplify is_abnormal_io() and make this function more readable and easier to modify. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240704052816.623865-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-05 00:42:04 -06:00
Benjamin Marzinski	25b3a8237a	md/raid5: recheck if reshape has finished with device_lock held When handling an IO request, MD checks if a reshape is currently happening, and if so, where the IO sector is in relation to the reshape progress. MD uses conf->reshape_progress for both of these tasks. When the reshape finishes, conf->reshape_progress is set to MaxSector. If this occurs after MD checks if the reshape is currently happening but before it calls ahead_of_reshape(), then ahead_of_reshape() will end up comparing the IO sector against MaxSector. During a backwards reshape, this will make MD think the IO sector is in the area not yet reshaped, causing it to use the previous configuration, and map the IO to the sector where that data was before the reshape. This bug can be triggered by running the lvm2 lvconvert-raid-reshape-linear_to_raid6-single-type.sh test in a loop, although it's very hard to reproduce. Fix this by factoring the code that checks where the IO sector is in relation to the reshape out to a helper called get_reshape_loc(), which reads reshape_progress and reshape_safe while holding the device_lock, and then rechecks if the reshape has finished before calling ahead_of_reshape with the saved values. Also use the helper during the REQ_NOWAIT check to see if the location is inside of the reshape region. Fixes: `fef9c61fdf` ("md/raid5: change reshape-progress measurement to cope with reshaping backwards.") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240702151802.1632010-1-bmarzins@redhat.com	2024-07-04 06:35:19 +00:00
Yu Kuai	a1fd37f978	md: Don't wait for MD_RECOVERY_NEEDED for HOT_REMOVE_DISK ioctl Commit `90f5f7ad4f` ("md: Wait for md_check_recovery before attempting device removal.") explained in the commit message that failed device must be reomoved from the personality first by md_check_recovery(), before it can be removed from the array. That's the reason the commit add the code to wait for MD_RECOVERY_NEEDED. However, this is not the case now, because remove_and_add_spares() is called directly from hot_remove_disk() from ioctl path, hence failed device(marked faulty) can be removed from the personality by ioctl. On the other hand, the commit introduced a performance problem that if MD_RECOVERY_NEEDED is set and the array is not running, ioctl will wait for 5s before it can return failure to user. Since the waiting is not needed now, fix the problem by removing the waiting. Fixes: `90f5f7ad4f` ("md: Wait for md_check_recovery before attempting device removal.") Reported-by: Mateusz Kusiak <mateusz.kusiak@linux.intel.com> Closes: https://lore.kernel.org/all/814ff6ee-47a2-4ba0-963e-cf256ee4ecfa@linux.intel.com/ Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240627112321.3044744-1-yukuai1@huaweicloud.com	2024-07-04 06:32:03 +00:00
Christophe JAILLET	1f4a72ff00	md-cluster: Constify struct md_cluster_operations 'struct md_cluster_operations' is not modified in this driver. Constifying this structure moves some data to a read-only section, so increase overall security. On a x86_64, with allmodconfig, as an example: Before: ====== text data bss dec hex filename 51941 1442 80 53463 d0d7 drivers/md/md-cluster.o After: ===== text data bss dec hex filename 52133 1246 80 53459 d0d3 drivers/md/md-cluster.o Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/3727f3ce9693cae4e62ae6778ea13971df805479.1719173852.git.christophe.jaillet@wanadoo.fr	2024-07-04 06:20:27 +00:00
Yang Li	ae720670b9	md: Remove unneeded semicolon ./drivers/md/md.c:630:21-22: Unneeded semicolon Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9344 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240618010759.85416-1-yang.lee@linux.alibaba.com	2024-07-04 06:14:03 +00:00
Yu Kuai	2314c2e3a7	md/raid5: fix spares errors about rcu usage As commit `ad8606702f` ("md/raid5: remove rcu protection to access rdev from conf") explains, rcu protection can be removed, however, there are three places left, there won't be any real problems. drivers/md/raid5.c:8071:24: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:8071:24: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:8071:24: struct md_rdev * drivers/md/raid5.c:7569:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7569:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7569:25: struct md_rdev * drivers/md/raid5.c:7573:25: error: incompatible types in comparison expression (different address spaces): drivers/md/raid5.c:7573:25: struct md_rdev [noderef] __rcu * drivers/md/raid5.c:7573:25: struct md_rdev * Fixes: `ad8606702f` ("md/raid5: remove rcu protection to access rdev from conf") Cc: stable@vger.kernel.org Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240615085143.1648223-1-yukuai1@huaweicloud.com	2024-07-04 06:11:53 +00:00
Luca Boccassi	6fce1f40e9	dm verity: add support for signature verification with platform keyring Add a new configuration CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG_PLATFORM_KEYRING that enables verifying dm-verity signatures using the platform keyring, which is populated using the UEFI DB certificates. This is useful for self-enrolled systems that do not use MOK, as the secondary keyring which is already used for verification, if the relevant kconfig is enabled, is linked to the machine keyring, which gets its certificates loaded from MOK. On datacenter/virtual/cloud deployments it is more common to deploy one's own certificate chain directly in DB on first boot in unattended mode, rather than relying on MOK, as the latter typically requires interactive authentication to enroll, and is more suited for personal machines. Default to the same value as DM_VERITY_VERIFY_ROOTHASH_SIG_SECONDARY_KEYRING if not otherwise specified, as it is likely that if one wants to use MOK certificates to verify dm-verity volumes, DB certificates are going to be used too. Keys in DB are allowed to load a full kernel already anyway, so they are already highly privileged. Signed-off-by: Luca Boccassi <bluca@debian.org> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:11 +02:00
Benjamin Marzinski	3199a34bfa	dm-raid: Fix WARN_ON_ONCE check for sync_thread in raid_resume rm-raid devices will occasionally trigger the following warning when being resumed after a table load because DM_RECOVERY_RUNNING is set: WARNING: CPU: 7 PID: 5660 at drivers/md/dm-raid.c:4105 raid_resume+0xee/0x100 [dm_raid] The failing check is: WARN_ON_ONCE(test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)); This check is designed to make sure that the sync thread isn't registered, but md_check_recovery can set MD_RECOVERY_RUNNING without the sync_thread ever getting registered. Instead of checking if MD_RECOVERY_RUNNING is set, check if sync_thread is non-NULL. Fixes: `16c4770c75` ("dm-raid: really frozen sync_thread during suspend") Suggested-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:11 +02:00
Eric Biggers	b76ad88442	dm-verity: hash blocks with shash import+finup when possible Currently dm-verity computes the hash of each block by using multiple calls to the "ahash" crypto API. While the exact sequence depends on the chosen dm-verity settings, in the vast majority of cases it is: 1. crypto_ahash_init() 2. crypto_ahash_update() [salt] 3. crypto_ahash_update() [data] 4. crypto_ahash_final() This is inefficient for two main reasons: - It makes multiple indirect calls, which is expensive on modern CPUs especially when mitigations for CPU vulnerabilities are enabled. Since the salt is the same across all blocks on a given dm-verity device, a much more efficient sequence would be to do an import of the pre-salted state, then a finup. - It uses the ahash (asynchronous hash) API, despite the fact that CPU-based hashing is almost always used in practice, and therefore it experiences the overhead of the ahash-based wrapper for shash. Because dm-verity was intentionally converted to ahash to support off-CPU crypto accelerators, a full reversion to shash might not be acceptable. Yet, we should still provide a fast path for shash with the most common dm-verity settings. Another reason for shash over ahash is that the upcoming multibuffer hashing support, which is specific to CPU-based hashing, is much better suited for shash than for ahash. Supporting it via ahash would add significant complexity and overhead. And it's not possible for the "same" code to properly support both multibuffer hashing and HW accelerators at the same time anyway, given the different computation models. Unfortunately there will always be code specific to each model needed (for users who want to support both). Therefore, this patch adds a new shash import+finup based fast path to dm-verity. It is used automatically when appropriate. This makes dm-verity optimized for what the vast majority of users want: CPU-based hashing with the most common settings, while still retaining support for rarer settings and off-CPU crypto accelerators. In benchmarks with veritysetup's default parameters (SHA-256, 4K data and hash block sizes, 32-byte salt), which also match the parameters that Android currently uses, this patch improves block hashing performance by about 15% on x86_64 using the SHA-NI instructions, or by about 5% on arm64 using the ARMv8 SHA2 instructions. On x86_64 roughly two-thirds of the improvement comes from the use of import and finup, while the remaining third comes from the switch from ahash to shash. Note that another benefit of using "import" to handle the salt is that if the salt size is equal to the input size of the hash algorithm's compression function, e.g. 64 bytes for SHA-256, then the performance is exactly the same as no salt. This doesn't seem to be much better than veritysetup's current default of 32-byte salts, due to the way SHA-256's finalization padding works, but it should be marginally better. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:11 +02:00
Eric Biggers	e8f5e93301	dm-verity: make verity_hash() take dm_verity_io instead of ahash_request In preparation for adding shash support to dm-verity, change verity_hash() to take a pointer to a struct dm_verity_io instead of a pointer to the ahash_request embedded inside it. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:11 +02:00
Eric Biggers	cf715f4b7e	dm-verity: always "map" the data blocks dm-verity needs to access data blocks by virtual address in three different cases (zeroization, recheck, and forward error correction), and one more case (shash support) is coming. Since it's guaranteed that dm-verity data blocks never cross pages, and kmap_local_page and kunmap_local are no-ops on modern platforms anyway, just unconditionally "map" every data block's page and work with the virtual buffer directly. This simplifies the code and eliminates unnecessary overhead. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:11 +02:00
Eric Biggers	09d1430896	dm-verity: provide dma_alignment limit in io_hints Since Linux v6.1, some filesystems support submitting direct I/O that is aligned to only dma_alignment instead of the logical_block_size alignment that was required before. I/O that is not aligned to the logical_block_size is difficult to handle in device-mapper targets that do cryptographic processing of data, as it makes the units of data that are hashed or encrypted possibly be split across pages, creating rarely used and rarely tested edge cases. As such, dm-crypt and dm-integrity have already opted out of this by setting dma_alignment to 'logical_block_size - 1'. Although dm-verity does have code that handles these cases (or at least is intended to do so), supporting direct I/O with such a low amount of alignment is not really useful on dm-verity devices. So, opt dm-verity out of it too so that it's not necessary to handle these edge cases. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:41:08 +02:00
Eric Biggers	a7ddb3d49d	dm-verity: make real_digest and want_digest fixed-length Change the digest fields in struct dm_verity_io from variable-length to fixed-length, since their maximum length is fixed at HASH_MAX_DIGESTSIZE, i.e. 64 bytes, which is not too big. This is simpler and makes the fields a bit faster to access. (HASH_MAX_DIGESTSIZE did not exist when this code was written, which may explain why it wasn't used.) This makes the verity_io_real_digest() and verity_io_want_digest() functions trivial, but this patch leaves them in place temporarily since most of their callers will go away in a later patch anyway. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:33:17 +02:00
Eric Biggers	e41e52e59e	dm-verity: move data hash mismatch handling into its own function Move the code that handles mismatches of data block hashes into its own function so that it doesn't clutter up verity_verify_io(). Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-03 21:32:15 +02:00
Christoph Hellwig	da042a3655	block: split integrity support out of bio.h Split struct bio_integrity_payload and the related prototypes out of bio.h into a separate bio-integrity.h header so that it is only pulled in by the few places that need it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240702151047.1746127-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-07-03 10:21:15 -06:00
Eric Biggers	44d36a2ea4	dm-verity: move hash algorithm setup into its own function Move the code that sets up the hash transformation into its own function. No change in behavior. Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 20:53:52 +02:00
Benjamin Marzinski	140ce37fd7	dm init: Handle minors larger than 255 dm_parse_device_entry() simply copies the minor number into dmi.dev, but the dev_t format splits the minor number between the lowest 8 bytes and highest 12 bytes. If the minor number is larger than 255, part of it will end up getting treated as the major number Fix this by checking that the minor number is valid and then encoding it as a dev_t. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 20:53:41 +02:00
Dr. David Alan Gilbert	c1a66a37d6	dm cache metadata: remove unused struct 'thunk' 'thunk' has been unused since commit `f177940a80` ("dm cache metadata: switch to using the new cursor api for loading metadata"). Remove it. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 20:53:05 +02:00
Benjamin Marzinski	babe69e86d	dm io: remove code duplication between sync_io and aysnc_io The only difference between the code to setup and dispatch the io in sync_io() and async_io() is the sync argument to dispatch_io(), which is used to update the opf argument. Update the opf argument direcly in sync_io(), and remove the sync argument from dispatch_io(). Then, make sync_io() call async_io() instead of duplicting all of its code. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 12:00:43 +02:00
Benjamin Marzinski	b0042ba768	dm io: don't call the async_io notify.fn on invalid num_regions If dm_io() returned an error, callers that set a notify.fn and wanted it called on an error need to check the return value and call notify.fn themselves if it was -EINVAL but not if it was -EIO. None of them do this (granted, all the existing async_io users of dm_io call it in a way that is guaranteed to not return an error). Simplify the interface by never calling the notify.fn if dm_io returns an error. This works with the existing dm_io callers which check for an error and handle it using the same methods as the notify.fn. This also allows us to move the now equivalent num_regions checks out of sync_io() and async_io() and into dm_io() itself. Additionally, change async_io() into a void function, since it can no longer fail. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 11:58:56 +02:00
Benjamin Marzinski	06a0b333e5	dm io: bump num_bvecs to handle offset memory If dp->get_page() returns a non-zero offset, the bio might need an additional bvec to deal with the offset. For example, if remaining is exactly one page size, but there is an offset, the memory will span two pages. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>	2024-07-02 11:58:08 +02:00
Christoph Hellwig	f1e46758e8	bcache: work around a __bitwise to bool conversion sparse warning Sparse is a bit dumb about bitwise operation on __bitwise types used in boolean contexts. Add a !! to explicitly propagate to boolean without a warning. Fixes: `fcf865e357` ("block: convert features and flags to __bitwise types") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Kent Overstreet <kent.overstreet@linux.dev> Link: https://lore.kernel.org/r/20240628131657.667797-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-28 10:25:00 -06:00
Christoph Hellwig	573d5abf3d	md: set md-specific flags for all queue limits The md driver wants to enforce a number of flags for all devices, even when not inheriting them from the underlying devices. To make sure these flags survive the queue_limits_set calls that md uses to update the queue limits without deriving them form the previous limits add a new md_init_stacking_limits helper that calls blk_set_stacking_limits and sets these flags. Fixes: `1122c0c1cc` ("block: move cache control settings out of queue->flags") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20240626142637.300624-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-26 09:37:35 -06:00
Mikulas Patocka	aaa53168cb	dm: optimize flushes Device mapper sends flush bios to all the targets and the targets send it to the underlying device. That may be inefficient, for example if a table contains 10 linear targets pointing to the same physical device, then device mapper would send 10 flush bios to that device - despite the fact that only one bio would be sufficient. This commit optimizes the flush behavior. It introduces a per-target variable flush_bypasses_map - it is set when the target supports flush optimization - currently, the dm-linear and dm-stripe targets support it. When all the targets in a table have flush_bypasses_map, flush_bypasses_map on the table is set. __send_empty_flush tests if the table has flush_bypasses_map - and if it has, no flush bios are sent to the targets via the "map" method and the list dm_table->devices is iterated and the flush bios are sent to each member of the list. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Suggested-by: Yang Yang <yang.yang@vivo.com>	2024-06-26 11:32:39 -04:00
Kuan-Wei Chiu	866898efbb	bcache: remove heap-related macros and switch to generic min_heap Drop the heap-related macros from bcache and replacing them with the generic min_heap implementation from include/linux. By doing so, code readability is improved by using functions instead of macros. Moreover, the min_heap implementation in include/linux adopts a bottom-up variation compared to the textbook version currently used in bcache. This bottom-up variation allows for approximately 50% reduction in the number of comparison operations during heap siftdown, without changing the number of swaps, thus making it more efficient. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-16-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Acked-by: Coly Li <colyli@suse.de> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 22:25:00 -07:00
Kuan-Wei Chiu	bfe3127180	lib min_heap: rename min_heapify() to min_heap_sift_down() After adding min_heap_sift_up(), the naming convention has been adjusted to maintain consistency with the min_heap_sift_up(). Consequently, min_heapify() has been renamed to min_heap_sift_down(). Link: https://lkml.kernel.org/CAP-5=fVcBAxt8Mw72=NCJPRJfjDaJcqk4rjbadgouAEAHz_q1A@mail.gmail.com Link: https://lkml.kernel.org/r/20240524152958.919343-13-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 22:24:59 -07:00
Kuan-Wei Chiu	267607e875	lib min_heap: add args for min_heap_callbacks Add a third parameter 'args' for the 'less' and 'swp' functions in the 'struct min_heap_callbacks'. This additional parameter allows these comparison and swap functions to handle extra arguments when necessary. Link: https://lkml.kernel.org/r/20240524152958.919343-9-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 22:24:58 -07:00
Kuan-Wei Chiu	873ce25766	lib min_heap: add type safe interface Implement a type-safe interface for min_heap using strong type pointers instead of void * in the data field. This change includes adding small macro wrappers around functions, enabling the use of __minheap_cast and __minheap_obj_size macros for type casting and obtaining element size. This implementation removes the necessity of passing element size in min_heap_callbacks. Additionally, introduce the MIN_HEAP_PREALLOCATED macro for preallocating some elements. Link: https://lkml.kernel.org/ioyfizrzq7w7mjrqcadtzsfgpuntowtjdw5pgn4qhvsdp4mqqg@nrlek5vmisbu Link: https://lkml.kernel.org/r/20240524152958.919343-5-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 22:24:57 -07:00
Kuan-Wei Chiu	b42995607e	bcache: fix typo Replace 'utiility' with 'utility'. Link: https://lkml.kernel.org/r/20240524152958.919343-3-visitorckw@gmail.com Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Ian Rogers <irogers@google.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Brian Foster <bfoster@redhat.com> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Coly Li <colyli@suse.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Matthew Sakai <msakai@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2024-06-24 22:24:56 -07:00
John Garry	f70167a7a6	block: Generalize chunk_sectors support as boundary support The purpose of the chunk_sectors limit is to ensure that a mergeble request fits within the boundary of the chunck_sector value. Such a feature will be useful for other request_queue boundary limits, so generalize the chunk_sectors merge code. This idea was proposed by Hannes Reinecke. Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20240620125359.2684798-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-20 15:19:17 -06:00
Jens Axboe	e821bcecdf	Merge branch 'for-6.11/block-limits' into for-6.11/block Merge in queue limits cleanups. * for-6.11/block-limits: block: move the raid_partial_stripes_expensive flag into the features field block: remove the discard_alignment flag block: move the misaligned flag into the features field block: renumber and rename the cache disabled flag block: fix spelling and grammar for in writeback_cache_control.rst block: remove the unused blk_bounce enum	2024-06-20 06:55:20 -06:00
Christoph Hellwig	7d4dec525f	block: move the raid_partial_stripes_expensive flag into the features field Move the raid_partial_stripes_expensive flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-20 06:53:15 -06:00
Christoph Hellwig	4cac3d3a71	block: remove the discard_alignment flag queue_limits.discard_alignment is never read except in the places where it is stacked into another limit. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240619154623.450048-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-20 06:53:14 -06:00
Jens Axboe	69c34f07e4	Merge branch 'for-6.11/block-limits' into for-6.11/block Merge in last round of queue limits changes from Christoph. * for-6.11/block-limits: (26 commits) block: move the bounce flag into the features field block: move the skip_tagset_quiesce flag to queue_limits block: move the pci_p2pdma flag to queue_limits block: move the zone_resetall flag to queue_limits block: move the zoned flag into the features field block: move the poll flag to queue_limits block: move the dax flag to queue_limits block: move the nowait flag to queue_limits block: move the synchronous flag to queue_limits block: move the stable_writes flag to queue_limits block: move the io_stat flag setting to queue_limits block: move the add_random flag to queue_limits block: move the nonrot flag to queue_limits block: move cache control settings out of queue->flags block: remove blk_flush_policy block: freeze the queue in queue_attr_store nbd: move setting the cache control flags to __nbd_set_size virtio_blk: remove virtblk_update_cache_mode loop: fold loop_update_rotational into loop_reconfigure_limits loop: also use the default block size from an underlying block device ... Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 08:14:49 -06:00
Christoph Hellwig	b1fc937a55	block: move the zoned flag into the features field Move the zoned flags into the features field to reclaim a little bit of space. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-23-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	8023e144f9	block: move the poll flag to queue_limits Move the poll flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	f467fee48d	block: move the dax flag to queue_limits Move the dax flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	f76af42f8b	block: move the nowait flag to queue_limits Move the nowait flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Stacking drivers are simplified in that they now can simply set the flag, and blk_stack_limits will clear it when the features is not supported by any of the underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-20-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	1a02f3a73f	block: move the stable_writes flag to queue_limits Move the stable_writes flag into the queue_limits feature field so that it can be set atomically with the queue frozen. The flag is now inherited by blk_stack_limits, which greatly simplifies the code in dm, and fixed md which previously did not pass on the flag set on lower devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	cdb2497918	block: move the io_stat flag setting to queue_limits Move the io_stat flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Simplify md and dm to set the flag unconditionally instead of avoiding setting a simple flag for cases where it already is set by other means, which is a bit pointless. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-17-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	39a9f1c334	block: move the add_random flag to queue_limits Move the add_random flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Note that this also removes code from dm to clear the flag based on the underlying devices, which can't be reached as dm devices will always start out without the flag set. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-16-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	bd4a633b6f	block: move the nonrot flag to queue_limits Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Christoph Hellwig	1122c0c1cc	block: move cache control settings out of queue->flags Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-19 07:58:28 -06:00
Damien Le Moal	eaa3706fed	dm: Remove unused macro DM_ZONE_INVALID_WP_OFST With the switch to using the zone append emulation of the block layer zone write plugging, the macro DM_ZONE_INVALID_WP_OFST is no longer used in dm-zone.c. Remove its definition. Fixes: `f211268ed1` ("dm: Use the block layer zone append emulation") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-15 20:42:20 -06:00
Damien Le Moal	73a74af0c7	dm: Improve zone resource limits handling The generic stacking of limits implemented in the block layer cannot correctly handle stacking of zone resource limits (max open zones and max active zones) because these limits are for an entire device but the stacking may be for a portion of that device (e.g. a dm-linear target that does not cover an entire block device). As a result, when DM devices are created on top of zoned block devices, the DM device never has any zone resource limits advertized, which is only correct if all underlying target devices also have no zone resource limits. If at least one target device has resource limits, the user may see either performance issues (if the max open zone limit of the device is exceeded) or write I/O errors if the max active zone limit of one of the underlying target devices is exceeded. While it is very difficult to correctly and reliably stack zone resource limits in general, cases where targets are not sharing zone resources of the same device can be dealt with relatively easily. Such situation happens when a target maps all sequential zones of a zoned block device: for such mapping, other targets mapping other parts of the same zoned block device can only contain conventional zones and thus will not require any zone resource to correctly handle write operations. For a mapped device constructed with such targets, which includes mapped devices constructed with targets mapping entire zoned block devices, the zone resource limits can be reliably determined using the non-zero minimum of the zone resource limits of all targets. For mapped devices that include targets partially mapping the set of sequential write required zones of zoned block devices, instead of advertizing no zone resource limits, it is also better to set the mapped device limits to the non-zero minimum of the limits of all targets. In this case the limits for a target depend on the number of sequential zones being mapped: if this number of zone is larger than the limits, then the limits of the device apply and can be used. If on the other hand the target maps a number of zones smaller than the limits, then no limits is needed and we can assume that the target has no limits (limits set to 0). This commit improves zone resource limits handling as described above by modifying dm_set_zones_restrictions() to iterate the targets of a mapped device to evaluate the max open and max active zone limits. This relies on an internal "stacking" of the limits of the target devices combined with a direct counting of the number of sequential zones mapped by the targets. 1) For a target mapping an entire zoned block device, the limits for the target are set to the limits of the device. 2) For a target partially mapping a zoned block device, the number of mapped sequential zones is used to determine the limits: if the target maps more sequential write required zones than the device limits, then the limits of the device are used as-is. If the number of mapped sequential zones is lower than the limits, then we assume that the target has no limits (limits set to 0). As this evaluation is done for each target, the zone resource limits for the mapped device are evaluated as the non-zero minimum of the limits of all the targets. For configurations resulting in unreliable limits, i.e. a table containing a target partially mapping a zoned device, a warning message is issued. The counting of mapped sequential zones for the target is done using the new function dm_device_count_zones() which performs a report zones on the entire block device with the callback dm_device_count_zones_cb(). This count of mapped sequential zones is also used to determine if the mapped device contains only conventional zones. This allows simplifying dm_set_zones_restrictions() to not do a report zones just for this. For mapped devices mapping only conventional zones, as before, the mapped device is changed to a regular device by setting its zoned limit to false and clearing all its zone related limits. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-15 20:42:20 -06:00
Damien Le Moal	7f91ccd8a6	dm: Call dm_revalidate_zones() after setting the queue limits dm_revalidate_zones() is called from dm_set_zone_restrictions() when the mapped device queue limits are not yet set. However, dm_revalidate_zones() calls blk_revalidate_disk_zones() and this function consults and modifies the mapped device queue limits. Thus, currently, blk_revalidate_disk_zones() operates on limits that are not yet initialized. Fix this by moving the call to dm_revalidate_zones() out of dm_set_zone_restrictions() and into dm_table_set_restrictions() after executing queue_limits_set(). To further cleanup dm_set_zones_restrictions(), the message about the type of zone append (native or emulated) is also moved inside dm_revalidate_zones(). Fixes: `1c0e720228` ("dm: use queue_limits_set") Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Reviewed-by: Niklas Cassel <cassel@kernel.org> Link: https://lore.kernel.org/r/20240611023639.89277-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-15 20:42:20 -06:00
Jens Axboe	e3e72fe4cb	Merge branch 'for-6.11/block-limits' into for-6.11/block Pull in block limits branch, which exists as a shared branch for both the block and SCSI tree. * for-6.11/block-limits: (26 commits) block: move integrity information into queue_limits block: invert the BLK_INTEGRITY_{GENERATE,VERIFY} flags block: bypass the STABLE_WRITES flag for protection information block: don't require stable pages for non-PI metadata block: use kstrtoul in flag_store block: factor out flag_{store,show} helper for integrity block: remove the blk_flush_integrity call in blk_integrity_unregister block: remove the blk_integrity_profile structure dm-integrity: use the nop integrity profile md/raid1: don't free conf on raid0_run failure md/raid0: don't free conf on raid0_run failure block: initialize integrity buffer to zero before writing it to media block: add special APIs for run-time disabling of discard and friends block: remove unused queue limits API sr: convert to the atomic queue limits API sd: convert to the atomic queue limits API sd: cleanup zoned queue limits initialization sd: factor out a sd_discard_mode helper sd: simplify the disable case in sd_config_discard sd: add a sd_disable_write_same helper ...	2024-06-14 10:22:08 -06:00
Christoph Hellwig	c6e56cf6b2	block: move integrity information into queue_limits Move the integrity information into the queue limits so that it can be set atomically with other queue limits, and that the sysfs changes to the read_verify and write_generate flags are properly synchronized. This also allows to provide a more useful helper to stack the integrity fields, although it still is separate from the main stacking function as not all stackable devices want to inherit the integrity settings. Even with that it greatly simplifies the code in md and dm. Note that the integrity field is moved as-is into the queue limits. While there are good arguments for removing the separate blk_integrity structure, this would cause a lot of churn and might better be done at a later time if desired. However the integrity field in the queue_limits structure is now unconditional so that various ifdefs can be avoided or replaced with IS_ENABLED(). Given that tiny size of it that seems like a worthwhile trade off. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-13-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:07 -06:00
Christoph Hellwig	e9f5f44ad3	block: remove the blk_integrity_profile structure Block layer integrity configuration is a bit complex right now, as it indirects through operation vectors for a simple two-dimensional configuration: a) the checksum type of none, ip checksum, crc, crc64 b) the presence or absence of a reference tag Remove the integrity profile, and instead add a separate csum_type flag which replaces the existing ip-checksum field and a new flag that indicates the presence of the reference tag. This removes up to two layers of indirect calls, remove the need to offload the no-op verification of non-PI metadata to a workqueue and generally simplifies the code. The downside is that block/t10-pi.c now has to be built into the kernel when CONFIG_BLK_DEV_INTEGRITY is supported. Given that both nvme and SCSI require t10-pi.ko, it is loaded for all usual configurations that enabled CONFIG_BLK_DEV_INTEGRITY already, though. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Kanchan Joshi <joshi.k@samsung.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:06 -06:00
Christoph Hellwig	63e649594a	dm-integrity: use the nop integrity profile Use the block layer built-in nop profile instead of duplicating it. Tested by: $ dd if=/dev/urandom of=key.bin bs=512 count=1 $ cryptsetup luksFormat -q --type luks2 --integrity hmac-sha256 \ --integrity-no-wipe /dev/nvme0n1 key.bin $ cryptsetup luksOpen /dev/nvme0n1 luks-integrity --key-file key.bin and then doing mkfs.xfs and simple I/O on the mount file system. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Milan Broz <gmazyland@gmail.com> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:06 -06:00
Christoph Hellwig	799af947ed	md/raid1: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `07f1a6850c` ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20240613084839.1044015-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:06 -06:00
Christoph Hellwig	d11854ed05	md/raid0: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `0c031fd37f` ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20240613084839.1044015-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-06-14 10:20:06 -06:00
Yu Kuai	305a5170dc	md/raid5: avoid BUG_ON() while continue reshape after reassembling Currently, mdadm support --revert-reshape to abort the reshape while reassembling, as the test 07revert-grow. However, following BUG_ON() can be triggerred by the test: kernel BUG at drivers/md/raid5.c:6278! invalid opcode: 0000 [#1] PREEMPT SMP PTI irq event stamp: 158985 CPU: 6 PID: 891 Comm: md0_reshape Not tainted 6.9.0-03335-g7592a0b0049a #94 RIP: 0010:reshape_request+0x3f1/0xe60 Call Trace: <TASK> raid5_sync_request+0x43d/0x550 md_do_sync+0xb7a/0x2110 md_thread+0x294/0x2b0 kthread+0x147/0x1c0 ret_from_fork+0x59/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> Root cause is that --revert-reshape update the raid_disks from 5 to 4, while reshape position is still set, and after reassembling the array, reshape position will be read from super block, then during reshape the checking of 'writepos' that is caculated by old reshape position will fail. Fix this panic the easy way first, by converting the BUG_ON() to WARN_ON(), and stop the reshape if checkings fail. Noted that mdadm must fix --revert-shape as well, and probably md/raid should enhance metadata validation as well, however this means reassemble will fail and there must be user tools to fix the wrong metadata. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-13-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	bc49694a9e	md: pass in max_sectors for pers->sync_request() For different sync_action, sync_thread will use different max_sectors, see details in md_sync_max_sectors(), currently both md_do_sync() and pers->sync_request() in eatch iteration have to get the same max_sectors. Hence pass in max_sectors for pers->sync_request() to prevent redundant code. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-12-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	bbf2076277	md: factor out helpers for different sync_action in md_do_sync() Make code cleaner by replacing if else if with switch, and it's more obvious now what is doing for each sync_action. There are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-11-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	d249e54188	md: replace last_sync_action with new enum type The only difference is that "none" is removed and initial last_sync_action will be idle. On the one hand, this value is introduced by commit `c4a3955145` ("MD: Remember the last sync operation that was performed"), and the usage described in commit message is not affected. On the other hand, last_sync_action is not used in mdadm or mdmon, and none of the tests that I can find. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-10-yukuai1@huaweicloud.com	2024-06-12 16:32:57 +00:00
Yu Kuai	7d9f107a4e	md: use new helpers in md_do_sync() Make code cleaner. and also use the action_name directly in kernel log: - "check" instead of "data-check" - "repair" instead of "requested-resync" Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-9-yukuai1@huaweicloud.com	2024-06-12 16:32:37 +00:00
Yu Kuai	5ce10a3859	md: don't fail action_store() if sync_thread is not registered MD_RECOVERY_RUNNING will always be set when trying to register a new sync_thread, however, if md_start_sync() turns out to do nothing, MD_RECOVERY_RUNNING will be cleared in this case. And during the race window, action_store() will return -EBUSY, which will cause some mdadm tests to fail. For example: The test 07reshape5intr will add a new disk to array, then start reshape: mdadm /dev/md0 --add /dev/xxx mdadm --grow /dev/md0 -n 3 And add_bound_rdev() from mdadm --add will set MD_RECOVERY_NEEDED, then during the race windown, mdadm --grow will fail. Fix the problem by waiting in action_store() during the race window, fail only if sync_thread is registered. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-8-yukuai1@huaweicloud.com	2024-06-12 16:27:50 +00:00
Yu Kuai	df79234bdc	md: remove parameter check_seq for stop_sync_thread() Caller will always set MD_RECOVERY_FROZEN if check_seq is true, and always clear MD_RECOVERY_FROZEN if check_seq is false, hence replace the parameter with test_bit() to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-7-yukuai1@huaweicloud.com	2024-06-12 16:27:50 +00:00
Yu Kuai	c8ecfe680c	md: replace sysfs api sync_action with new helpers To get rid of extrem long if else if usage, and make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-6-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Yu Kuai	207c5656c3	md: factor out helper to start reshape from action_store() There are no functional changes, just to make code cleaner and prepare for following refactor. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-5-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Yu Kuai	e792a4c215	md: add new helpers for sync_action The new helpers will get current sync_action of the array, will be used in later patches to make code cleaner. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-4-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Yu Kuai	a85aa09da2	md: add a new enum type sync_action In order to make code related to sync_thread cleaner in following patches, also add detail comment about each sync action. And also prepare to remove the related recovery_flags in the fulture. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-3-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Yu Kuai	0476d09c36	md: rearrange recovery_flags Currently there are lots of flags with the same confusing prefix "MD_REOCVERY_", and there are two main types of flags, sync thread runnng status, I prefer prefix "SYNC_THREAD_", and sync thread action, I perfer prefix "SYNC_ACTION_". For now, rearrange and update comment to improve code readability, there are no functional changes. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240611132251.1967786-2-yukuai1@huaweicloud.com	2024-06-12 16:27:49 +00:00
Ofir Gal	ab99a87542	md/md-bitmap: fix writing non bitmap pages __write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length. For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock. This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES. As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on. The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'. The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug. In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest). I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 __write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === __write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1 ... Cc: stable@vger.kernel.org Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ofir Gal <ofir.gal@volumez.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240607072748.3182199-1-ofir.gal@volumez.com	2024-06-11 21:22:21 +00:00
Alexander Aring	5ce02000eb	md-cluster: use DLM_LSFL_SOFTIRQ for dlm_new_lockspace() Use the recently added DLM_LSFL_SOFTIRQ flag in dlm_new_lockspace(), signalling the ability to handle callbacks being run from softirq context. The md-cluster callback functions only call complete(), which is suitable for softirq. This should make dlm lock request completions more efficient by avoiding the workqueue context switch. Acked-by: Heming Zhao <heming.zhao@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Alexander Aring <aahringo@redhat.com> Signed-off-by: David Teigland <teigland@redhat.com>	2024-06-11 13:30:37 -05:00
Christoph Hellwig	17f91ac084	md/raid1: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `07f1a6850c` ("md/raid1: fail run raid1 array when active disk less than one") Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-3-hch@lst.de	2024-06-10 19:18:55 +00:00
Christoph Hellwig	35f20acaa3	md/raid0: don't free conf on raid0_run failure The core md code calls the ->free method which already frees conf. Fixes: `0c031fd37f` ("md: Move alloc/free acct bioset in to personality") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240604172607.3185916-2-hch@lst.de	2024-06-10 19:18:55 +00:00
Li Nan	acc6680af2	md: make md_flush_request() more readable Setting bio to NULL and checking 'if(!bio)' is redundant and looks strange, just consolidate them into one condition. There are no functional changes. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240528203149.2383260-1-linan666@huaweicloud.com	2024-06-10 19:15:44 +00:00
Li Nan	611d5cbc0b	md: fix deadlock between mddev_suspend and flush bio Deadlock occurs when mddev is being suspended while some flush bio is in progress. It is a complex issue. T1. the first flush is at the ending stage, it clears 'mddev->flush_bio' and tries to submit data, but is blocked because mddev is suspended by T4. T2. the second flush sets 'mddev->flush_bio', and attempts to queue md_submit_flush_data(), which is already running (T1) and won't execute again if on the same CPU as T1. T3. the third flush inc active_io and tries to flush, but is blocked because 'mddev->flush_bio' is not NULL (set by T2). T4. mddev_suspend() is called and waits for active_io dec to 0 which is inc by T3. T1 T2 T3 T4 (flush 1) (flush 2) (third 3) (suspend) md_submit_flush_data mddev->flush_bio = NULL; . . md_flush_request . mddev->flush_bio = bio . queue submit_flushes . . . . md_handle_request . . active_io + 1 . . md_flush_request . . wait !mddev->flush_bio . . . . mddev_suspend . . wait !active_io . . . submit_flushes . queue_work md_submit_flush_data . //md_submit_flush_data is already running (T1) . md_handle_request wait resume The root issue is non-atomic inc/dec of active_io during flush process. active_io is dec before md_submit_flush_data is queued, and inc soon after md_submit_flush_data() run. md_flush_request active_io + 1 submit_flushes active_io - 1 md_submit_flush_data md_handle_request active_io + 1 make_request active_io - 1 If active_io is dec after md_handle_request() instead of within submit_flushes(), make_request() can be called directly intead of md_handle_request() in md_submit_flush_data(), and active_io will only inc and dec once in the whole flush process. Deadlock will be fixed. Additionally, the only difference between fixing the issue and before is that there is no return error handling of make_request(). But after previous patch cleaned md_write_start(), make_requst() only return error in raid5_make_request() by dm-raid, see commit `41425f96d7` ("dm-raid456, md/raid456: fix a deadlock for dm-raid456 while io concurrent with reshape)". Since dm always splits data and flush operation into two separate io, io size of flush submitted by dm always is 0, make_request() will not be called in md_submit_flush_data(). To prevent future modifications from introducing issues, add WARN_ON to ensure make_request() no error is returned in this context. Fixes: `fa2bbff7b0` ("md: synchronize flush io with array reconfiguration") Signed-off-by: Li Nan <linan122@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-3-linan666@huaweicloud.com	2024-06-10 19:10:25 +00:00
Li Nan	03e792eaf1	md: change the return value type of md_write_start to void Commit `cc27b0c78c` ("md: fix deadlock between mddev_suspend() and md_write_start()") aborted md_write_start() with false when mddev is suspended, which fixed a deadlock if calling mddev_suspend() with holding reconfig_mutex(). Since mddev_suspend() now includes lockdep_assert_not_held(), it no longer holds the reconfig_mutex. This makes previous abort unnecessary. Now, remove unnecessary abort and change function return value to void. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240525185257.3896201-2-linan666@huaweicloud.com	2024-06-10 19:10:25 +00:00
Li Nan	a8768a1345	md: do not delete safemode_timer in mddev_suspend The deletion of safemode_timer in mddev_suspend() is redundant and potentially harmful now. If timer is about to be woken up but gets deleted, 'in_sync' will remain 0 until the next write, causing array to stay in the 'active' state instead of transitioning to 'clean'. Commit `0d9f4f135e` ("MD: Add del_timer_sync to mddev_suspend (fix nasty panic))" introduced this deletion for dm, because if timer fired after dm is destroyed, the resource which the timer depends on might have been freed. However, commit `0dd84b3193` ("md: call __md_stop_writes in md_stop") added __md_stop_writes() to md_stop(), which is called before freeing resource. Timer is deleted in __md_stop_writes(), and the origin issue is resolved. Therefore, delete safemode_timer can be removed safely now. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20240508092053.1447930-1-linan666@huaweicloud.com	2024-06-10 16:26:28 +00:00
Coly Li	74d4ce92e0	bcache: code cleanup in __bch_bucket_alloc_set() In __bch_bucket_alloc_set() the lines after lable 'err:' indeed do nothing useful after multiple cache devices are removed from bcache code. This cleanup patch drops the useless code to save a bit CPU cycles. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Coly Li	05356938a4	bcache: call force_wake_up_gc() if necessary in check_should_bypass() If there are extreme heavy write I/O continuously hit on relative small cache device (512GB in my testing), it is possible to make counter c->gc_stats.in_use continue to increase and exceed CUTOFF_CACHE_ADD. If 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, all following write requests will bypass the cache device because check_should_bypass() returns 'true'. Because all writes bypass the cache device, counter c->sectors_to_gc has no chance to be negative value, and garbage collection thread won't be waken up even the whole cache becomes clean after writeback accomplished. The aftermath is that all write I/Os go directly into backing device even the cache device is clean. To avoid the above situation, this patch uses a quite conservative way to fix: if 'c->gc_stats.in_use > CUTOFF_CACHE_ADD' happens, only wakes up garbage collection thread when the whole cache device is clean. Before the fix, the writes-always-bypass situation happens after 10+ hours write I/O pressure on 512GB Intel optane memory which acts as cache device. After this fix, such situation doesn't happen after 36+ hours testing. Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Dongsheng Yang	a14a68b769	bcache: allow allocator to invalidate bucket in gc Currently, if the gc is running, when the allocator found free_inc is empty, allocator has to wait the gc finish. Before that, the IO is blocked. But actually, there would be some buckets is reclaimable before gc, and gc will never mark this kind of bucket to be unreclaimable. So we can put these buckets into free_inc in gc running to avoid IO being blocked. Signed-off-by: Dongsheng Yang <dongsheng.yang@easystack.cn> Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240528120914.28705-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-28 06:55:59 -06:00
Christoph Hellwig	c8c1f7012b	dm: make dm_set_zones_restrictions work on the queue limits Don't stuff the values directly into the queue without any synchronization, but instead delay applying the queue limits in the caller and let dm_set_zones_restrictions work on the limit structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Christoph Hellwig	5e7a4bbcc3	dm: remove dm_check_zoned Fold it into the only caller in preparation to changes in the queue limits setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Christoph Hellwig	d9780064b1	dm: move setting zoned_enabled to dm_table_set_restrictions Keep it together with the rest of the zoned code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20240527123634.1116952-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-27 09:16:39 -06:00
Linus Torvalds	856726396d	- Fix DM discard regressions due to DM core switching over to using queue_limits_set() without DM core and targets first being updated to set (and stack) discard limits in terms of max_hw_discard_sectors and not max_discard_sectors. - Fix stable@ DM integrity discard support to set device's discard_granularity limit to the device's logical block size. -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmZMv3kACgkQxSPxCi2d A1pX2gf+NjaTP+6otMJ44sYwUGHuWtgwDy7NcoTiF4RvLQjjrUh8Lgm2p3i4evFs 4AzMNXy5V6/mEx/AZb7sjCtPkIGOykkBud3+jhStohQKvXEJQcYtwACNv151NW2c Fw2tcPZbPKH8P8UkDkegLaCu+4DotYjhuw44dfHFVZ95Wlhcm2UmcjWQEf/LtYW0 Si2FE0z2n1yi1mY6cExuL0bJ+LaMrXRQzHE3ZPaPRn4PFNUTY1juKsHYbgyL7cZ+ xQM1W/KBrKzDztCJGKH4Cl86L3kNPRkiQ7BQ/2Wtse20o10EGT0+lPvevCNcPNpi gbjAo7OPy9WPA9N/AR8Wfj06YQFaGA== =Pp+x -----END PGP SIGNATURE----- Merge tag 'for-6.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper fixes from Mike Snitzer: - Fix DM discard regressions due to DM core switching over to using queue_limits_set() without DM core and targets first being updated to set (and stack) discard limits in terms of max_hw_discard_sectors and not max_discard_sectors - Fix stable@ DM integrity discard support to set device's discard_granularity limit to the device's logical block size * tag 'for-6.10/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm: always manage discard support in terms of max_hw_discard_sectors dm-integrity: set discard_granularity to logical block size	2024-05-21 11:43:11 -07:00
Linus Torvalds	38da32ee70	bd_inode series Replacement of bdev->bd_inode with sane(r) set of primitives. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwjlgAKCRBZ7Krx/gZQ 66OmAP9nhZLASn/iM2+979I6O0GW+vid+uLh48uW3d+LbsmVIgD9GYpR+cuLQ/xj mJESWfYKOVSpFFSrqlzKg9PQlU/GFgs= =6LRp -----END PGP SIGNATURE----- Merge tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull bdev bd_inode updates from Al Viro: "Replacement of bdev->bd_inode with sane(r) set of primitives by me and Yu Kuai" * tag 'pull-bd_inode-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: RIP ->bd_inode dasd_format(): killing the last remaining user of ->bd_inode nilfs_attach_log_writer(): use ->bd_mapping->host instead of ->bd_inode block/bdev.c: use the knowledge of inode/bdev coallocation gfs2: more obvious initializations of mapping->host fs/buffer.c: massage the remaining users of ->bd_inode to ->bd_mapping blk_ioctl_{discard,zeroout}(): we only want ->bd_inode->i_mapping here... grow_dev_folio(): we only want ->bd_inode->i_mapping there use ->bd_mapping instead of ->bd_inode->i_mapping block_device: add a pointer to struct address_space (page cache of bdev) missing helpers: bdev_unhash(), bdev_drop() block: move two helpers into bdev.c block2mtd: prevent direct access of bd_inode dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) blkdev_write_iter(): saner way to get inode and bdev bcachefs: remove dead function bdev_sectors() ext4: remove block_device_ejected() erofs_buf: store address_space instead of inode erofs: switch erofs_bread() to passing offset instead of block number	2024-05-21 09:51:42 -07:00
Linus Torvalds	5ad8b6ad9a	getting rid of bogus set_blocksize() uses, switching it to struct file * and verifying that caller has device opened exclusively. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZkwkfQAKCRBZ7Krx/gZQ 62C3AQDW5vuXNx2+KDPma5YStjFpPLC0xtSyAS5D3YANjtyRFgD/TOcCarq7rvBt KubxHVFsfW+eu6ASeaoMRB83w5OIzwk= =Liix -----END PGP SIGNATURE----- Merge tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file ' and verifies that the caller has the device opened exclusively" tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()	2024-05-21 08:34:51 -07:00
Mike Snitzer	825d8bbd2f	dm: always manage discard support in terms of max_hw_discard_sectors Commit `4f563a6473` ("block: add a max_user_discard_sectors queue limit") changed block core to set max_discard_sectors to: min(lim->max_hw_discard_sectors, lim->max_user_discard_sectors) Since commit `1c0e720228` ("dm: use queue_limits_set") it was reported dm-thinp was failing in a few fstests (generic/347 and generic/405) with the first WARN_ON_ONCE in dm_cell_key_has_valid_range() being reported, e.g.: WARNING: CPU: 1 PID: 30 at drivers/md/dm-bio-prison-v1.c:128 dm_cell_key_has_valid_range+0x3d/0x50 blk_set_stacking_limits() sets max_user_discard_sectors to UINT_MAX, so given how block core now sets max_discard_sectors (detailed above) it follows that blk_stack_limits() stacks up the underlying device's max_hw_discard_sectors and max_discard_sectors is set to match it. If max_hw_discard_sectors exceeds dm's BIO_PRISON_MAX_RANGE, then dm_cell_key_has_valid_range() will trigger the warning with: WARN_ON_ONCE(key->block_end - key->block_begin > BIO_PRISON_MAX_RANGE) Aside from this warning, the discard will fail. Fix this and other DM issues by governing discard support in terms of max_hw_discard_sectors instead of max_discard_sectors. Reported-by: Theodore Ts'o <tytso@mit.edu> Fixes: `1c0e720228` ("dm: use queue_limits_set") Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-20 15:51:19 -04:00
Mikulas Patocka	69381cf88a	dm-integrity: set discard_granularity to logical block size dm-integrity could set discard_granularity lower than the logical block size. This could result in failures when sending discard requests to dm-integrity. This fix is needed for kernels prior to 6.10. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reported-by: Eric Wheeler <linux-integrity@lists.ewheeler.net> Cc: stable@vger.kernel.org # <= 6.9 Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-20 15:49:22 -04:00
Linus Torvalds	ff9a79307f	Kbuild updates for v6.10 - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig -----BEGIN PGP SIGNATURE----- iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmZFlGcVHG1hc2FoaXJv eUBrZXJuZWwub3JnAAoJED2LAQed4NsG8voQALC8NtFpduWVfLRj2Qg6Ll/xf1vX 2igcTJEOFHkeqXLGoT8dTDKLEipUBUvKyguPq66CGwVTe2g6zy/nUSXeVtFrUsIa msLTi8FqhqUo5lodNvGMRf8qqmuqcvnXoiQwIocF92jtsFy14bhiFY+n4HfcFNjj GOKwqBZYQUwY/VVb090efc7RfS9c7uwABJSBelSoxg3AGZriwjGy7Pw5aSKGgVYi inqL1eR6qwPP6z7CgQWM99soP+zwybFZmnQrsD9SniRBI4rtAat8Ih5jQFaSUFUQ lk2w0NQBRFN88/uR2IJ2GWuIlQ74WeJ+QnCqVuQ59tV5zw90wqSmLzngfPD057Dv JjNuhk0UyXVtpIg3lRtd4810ppNSTe33b9OM4O2H846W/crju5oDRNDHcflUXcwm Rmn5ho1rb5QVzDVejJbgwidnUInSgJ9PZcvXQ/RJVZPhpgsBzAY9pQexG1G3hviw y9UDrt6KP6bF9tHjmolmtdIes9Pj0c4dN6/Rdj4HS4hIQ/GDar0tnwvOvtfUctNL orJlBsA6GeMmDVXKkR0ytOCWRYqWWbyt8g70RVKQJfuHX7/hGyAQPaQ2/u4mQhC2 aevYfbNJMj0VDfGz81HDBKFtkc5n+Ite8l157dHEl2LEabkOkRdNVcn7SNbOvZmd ZCSnZ31h7woGfNho =D5B/ -----END PGP SIGNATURE----- Merge tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild Pull Kbuild updates from Masahiro Yamada: - Avoid 'constexpr', which is a keyword in C23 - Allow 'dtbs_check' and 'dt_compatible_check' run independently of 'dt_binding_check' - Fix weak references to avoid GOT entries in position-independent code generation - Convert the last use of 'optional' property in arch/sh/Kconfig - Remove support for the 'optional' property in Kconfig - Remove support for Clang's ThinLTO caching, which does not work with the .incbin directive - Change the semantics of $(src) so it always points to the source directory, which fixes Makefile inconsistencies between upstream and downstream - Fix 'make tar-pkg' for RISC-V to produce a consistent package - Provide reasonable default coverage for objtool, sanitizers, and profilers - Remove redundant OBJECT_FILES_NON_STANDARD, KASAN_SANITIZE, etc. - Remove the last use of tristate choice in drivers/rapidio/Kconfig - Various cleanups and fixes in Kconfig * tag 'kbuild-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (46 commits) kconfig: use sym_get_choice_menu() in sym_check_prop() rapidio: remove choice for enumeration kconfig: lxdialog: remove initialization with A_NORMAL kconfig: m/nconf: merge two item_add_str() calls kconfig: m/nconf: remove dead code to display value of bool choice kconfig: m/nconf: remove dead code to display children of choice members kconfig: gconf: show checkbox for choice correctly kbuild: use GCOV_PROFILE and KCSAN_SANITIZE in scripts/Makefile.modfinal Makefile: remove redundant tool coverage variables kbuild: provide reasonable defaults for tool coverage modules: Drop the .export_symbol section from the final modules kconfig: use menu_list_for_each_sym() in sym_check_choice_deps() kconfig: use sym_get_choice_menu() in conf_write_defconfig() kconfig: add sym_get_choice_menu() helper kconfig: turn defaults and additional prompt for choice members into error kconfig: turn missing prompt for choice members into error kconfig: turn conf_choice() into void function kconfig: use linked list in sym_set_changed() kconfig: gconf: use MENU_CHANGED instead of SYMBOL_CHANGED kconfig: gconf: remove debug code ...	2024-05-18 12:39:20 -07:00
Linus Torvalds	1b294a1f35	Networking changes for 6.10. Core & protocols ---------------- - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked, and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code -------------------------------------------- - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter --------- - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF --- - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API ---------- - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling ----------------- - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers ------- - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup. - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmZD6sQACgkQMUZtbf5S IrtLYw/+I73ePGIye37o2jpbodcLAUZVfF3r6uYUzK8hokEcKD0QVJa9w7PizLZ3 UO45ClOXFLJCkfP4reFenLfxGCel2AJI+F7VFl2xaO2XgrcH/lnVrHqKZEAEXjls KoYMnShIolv7h2MKP6hHtyTi2j1wvQUKsZC71o9/fuW+4fUT8gECx1YtYcL73wrw gEMdlUgBYC3jiiCUHJIFX6iPJ2t/TC+q1eIIF2K/Osrk2kIqQhzoozcL4vpuAZQT 99ljx/qRelXa8oppDb7nM5eulg7WY8ZqxEfFZphTMC5nLEGzClxuOTTl2kDYI/D/ UZmTWZDY+F5F0xvNk2gH84qVJXBOVDoobpT7hVA/tDuybobc/kvGDzRayEVqVzKj Q0tPlJs+xBZpkK5TVnxaFLJVOM+p1Xosxy3kNVXmuYNBvT/R89UbJiCrUKqKZF+L z/1mOYUv8UklHqYAeuJSptHvqJjTGa/fsEYP7dAUBbc1N2eVB8mzZ4mgU5rYXbtC E6UXXiWnoSRm8bmco9QmcWWoXt5UGEizHSJLz6t1R5Df/YmXhWlytll5aCwY1ksf FNoL7S4u7AZThL1Nwi7yUs4CAjhk/N4aOsk+41S0sALCx30BJuI6UdesAxJ0lu+Z fwCQYbs27y4p7mBLbkYwcQNxAxGm7PSK4yeyRIy2njiyV4qnLf8= =EsC2 -----END PGP SIGNATURE----- Merge tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Complete rework of garbage collection of AF_UNIX sockets. AF_UNIX is prone to forming reference count cycles due to fd passing functionality. New method based on Tarjan's Strongly Connected Components algorithm should be both faster and remove a lot of workarounds we accumulated over the years. - Add TCP fraglist GRO support, allowing chaining multiple TCP packets and forwarding them together. Useful for small switches / routers which lack basic checksum offload in some scenarios (e.g. PPPoE). - Support using SMP threads for handling packet backlog i.e. packet processing from software interfaces and old drivers which don't use NAPI. This helps move the processing out of the softirq jumble. - Continue work of converting from rtnl lock to RCU protection. Don't require rtnl lock when reading: IPv6 routing FIB, IPv6 address labels, netdev threaded NAPI sysfs files, bonding driver's sysfs files, MPLS devconf, IPv4 FIB rules, netns IDs, tcp metrics, TC Qdiscs, neighbor entries, ARP entries via ioctl(SIOCGARP), a lot of the link information available via rtnetlink. - Small optimizations from Eric to UDP wake up handling, memory accounting, RPS/RFS implementation, TCP packet sizing etc. - Allow direct page recycling in the bulk API used by XDP, for +2% PPS. - Support peek with an offset on TCP sockets. - Add MPTCP APIs for querying last time packets were received/sent/acked and whether MPTCP "upgrade" succeeded on a TCP socket. - Add intra-node communication shortcut to improve SMC performance. - Add IPv6 (and IPv{4,6}-over-IPv{4,6}) support to the GTP protocol driver. - Add HSR-SAN (RedBOX) mode of operation to the HSR protocol driver. - Add reset reasons for tracing what caused a TCP reset to be sent. - Introduce direction attribute for xfrm (IPSec) states. State can be used either for input or output packet processing. Things we sprinkled into general kernel code: - Add bitmap_{read,write}(), bitmap_size(), expose BYTES_TO_BITS(). This required touch-ups and renaming of a few existing users. - Add Endian-dependent __counted_by_{le,be} annotations. - Make building selftests "quieter" by printing summaries like "CC object.o" rather than full commands with all the arguments. Netfilter: - Use GFP_KERNEL to clone elements, to deal better with OOM situations and avoid failures in the .commit step. BPF: - Add eBPF JIT for ARCv2 CPUs. - Support attaching kprobe BPF programs through kprobe_multi link in a session mode, meaning, a BPF program is attached to both function entry and return, the entry program can decide if the return program gets executed and the entry program can share u64 cookie value with return program. "Session mode" is a common use-case for tetragon and bpftrace. - Add the ability to specify and retrieve BPF cookie for raw tracepoint programs in order to ease migration from classic to raw tracepoints. - Add an internal-only BPF per-CPU instruction for resolving per-CPU memory addresses and implement support in x86, ARM64 and RISC-V JITs. This allows inlining functions which need to access per-CPU state. - Optimize x86 BPF JIT's emit_mov_imm64, and add support for various atomics in bpf_arena which can be JITed as a single x86 instruction. Support BPF arena on ARM64. - Add a new bpf_wq API for deferring events and refactor process-context bpf_timer code to keep common code where possible. - Harden the BPF verifier's and/or/xor value tracking. - Introduce crypto kfuncs to let BPF programs call kernel crypto APIs. - Support bpf_tail_call_static() helper for BPF programs with GCC 13. - Add bpf_preempt_{disable,enable}() kfuncs in order to allow a BPF program to have code sections where preemption is disabled. Driver API: - Skip software TC processing completely if all installed rules are marked as HW-only, instead of checking the HW-only flag rule by rule. - Add support for configuring PoE (Power over Ethernet), similar to the already existing support for PoDL (Power over Data Line) config. - Initial bits of a queue control API, for now allowing a single queue to be reset without disturbing packet flow to other queues. - Common (ethtool) statistics for hardware timestamping. Tests and tooling: - Remove the need to create a config file to run the net forwarding tests so that a naive "make run_tests" can exercise them. - Define a method of writing tests which require an external endpoint to communicate with (to send/receive data towards the test machine). Add a few such tests. - Create a shared code library for writing Python tests. Expose the YAML Netlink library from tools/ to the tests for easy Netlink access. - Move netfilter tests under net/, extend them, separate performance tests from correctness tests, and iron out issues found by running them "on every commit". - Refactor BPF selftests to use common network helpers. - Further work filling in YAML definitions of Netlink messages for: nftables, team driver, bonding interfaces, vlan interfaces, VF info, TC u32 mark, TC police action. - Teach Python YAML Netlink to decode attribute policies. - Extend the definition of the "indexed array" construct in the specs to cover arrays of scalars rather than just nests. - Add hyperlinks between definitions in generated Netlink docs. Drivers: - Make sure unsupported flower control flags are rejected by drivers, and make more drivers report errors directly to the application rather than dmesg (large number of driver changes from Asbjørn Sloth Tønnesen). - Ethernet high-speed NICs: - Broadcom (bnxt): - support multiple RSS contexts and steering traffic to them - support XDP metadata - make page pool allocations more NUMA aware - Intel (100G, ice, idpf): - extract datapath code common among Intel drivers into a library - use fewer resources in switchdev by sharing queues with the PF - add PFCP filter support - add Ethernet filter support - use a spinlock instead of HW lock in PTP clock ops - support 5 layer Tx scheduler topology - nVidia/Mellanox: - 800G link modes and 100G SerDes speeds - per-queue IRQ coalescing configuration - Marvell Octeon: - support offloading TC packet mark action - Ethernet NICs consumer, embedded and virtual: - stop lying about skb->truesize in USB Ethernet drivers, it messes up TCP memory calculations - Google cloud vNIC: - support changing ring size via ethtool - support ring reset using the queue control API - VirtIO net: - expose flow hash from RSS to XDP - per-queue statistics - add selftests - Synopsys (stmmac): - support controllers which require an RX clock signal from the MII bus to perform their hardware initialization - TI: - icssg_prueth: support ICSSG-based Ethernet on AM65x SR1.0 devices - icssg_prueth: add SW TX / RX Coalescing based on hrtimers - cpsw: minimal XDP support - Renesas (ravb): - support describing the MDIO bus - Realtek (r8169): - add support for RTL8168M - Microchip Sparx5: - matchall and flower actions mirred and redirect - Ethernet switches: - nVidia/Mellanox: - improve events processing performance - Marvell: - add support for MV88E6250 family internal PHYs - Microchip: - add DCB and DSCP mapping support for KSZ switches - vsc73xx: convert to PHYLINK - Realtek: - rtl8226b/rtl8221b: add C45 instances and SerDes switching - Many driver changes related to PHYLIB and PHYLINK deprecated API cleanup - Ethernet PHYs: - Add a new driver for Airoha EN8811H 2.5 Gigabit PHY. - micrel: lan8814: add support for PPS out and external timestamp trigger - WiFi: - Disable Wireless Extensions (WEXT) in all Wi-Fi 7 devices drivers. Modern devices can only be configured using nl80211. - mac80211/cfg80211 - handle color change per link for WiFi 7 Multi-Link Operation - Intel (iwlwifi): - don't support puncturing in 5 GHz - support monitor mode on passive channels - BZ-W device support - P2P with HE/EHT support - re-add support for firmware API 90 - provide channel survey information for Automatic Channel Selection - MediaTek (mt76): - mt7921 LED control - mt7925 EHT radiotap support - mt7920e PCI support - Qualcomm (ath11k): - P2P support for QCA6390, WCN6855 and QCA2066 - support hibernation - ieee80211-freq-limit Device Tree property support - Qualcomm (ath12k): - refactoring in preparation of multi-link support - suspend and hibernation support - ACPI support - debugfs support, including dfs_simulate_radar support - RealTek: - rtw88: RTL8723CS SDIO device support - rtw89: RTL8922AE Wi-Fi 7 PCI device support - rtw89: complete features of new WiFi 7 chip 8922AE including BT-coexistence and Wake-on-WLAN - rtw89: use BIOS ACPI settings to set TX power and channels - rtl8xxxu: enable Management Frame Protection (MFP) support - Bluetooth: - support for Intel BlazarI and Filmore Peak2 (BE201) - support for MediaTek MT7921S SDIO - initial support for Intel PCIe BT driver - remove HCI_AMP support" * tag 'net-next-6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1827 commits) selftests: netfilter: fix packetdrill conntrack testcase net: gro: fix napi_gro_cb zeroed alignment Bluetooth: btintel_pcie: Refactor and code cleanup Bluetooth: btintel_pcie: Fix warning reported by sparse Bluetooth: hci_core: Fix not handling hdev->le_num_of_adv_sets=1 Bluetooth: btintel: Fix compiler warning for multi_v7_defconfig config Bluetooth: btintel_pcie: Fix compiler warnings Bluetooth: btintel_pcie: Add setup function to download firmware Bluetooth: btintel_pcie: Add support for PCIe transport Bluetooth: btintel: Export few static functions Bluetooth: HCI: Remove HCI_AMP support Bluetooth: L2CAP: Fix div-by-zero in l2cap_le_flowctl_init() Bluetooth: qca: Fix error code in qca_read_fw_build_info() Bluetooth: hci_conn: Use __counted_by() and avoid -Wfamnae warning Bluetooth: btintel: Add support for Filmore Peak2 (BE201) Bluetooth: btintel: Add support for BlazarI LE Create Connection command timeout increased to 20 secs dt-bindings: net: bluetooth: Add MediaTek MT7921S SDIO Bluetooth Bluetooth: compute LE flow credits based on recvbuf space Bluetooth: hci_sync: Use cmd->num_cis instead of magic number ...	2024-05-14 19:42:24 -07:00
Linus Torvalds	4f8b6f25eb	- Add a dm-crypt optional "high_priority" flag that enables the crypt workqueues to use WQ_HIGHPRI. - Export dm-crypt workqueues via sysfs (by enabling WQ_SYSFS) to allow for improved visibility and controls over IO and crypt workqueues. - Fix dm-crypt to no longer constrain max_segment_size to PAGE_SIZE. This limit isn't needed given that the block core provides late bio splitting if bio exceeds underlying limits (e.g. max_segment_size). - Fix dm-crypt crypt_queue's use of WQ_UNBOUND to not use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. - Fix various issues with dm-delay target (ranging from a resource teardown fix, a fix for hung task when using kthread mode, and other improvements that followed from code inspection). -----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmZDiggACgkQxSPxCi2d A1oQrwf7BUHy7ehwCjRrVlFTteIlx0ULTpPxictakN/S+xZfcZUcbE20OjNVzdk9 m5dx4Gn557rlMkiC4NDlHazVEVM5BbTpih27rvUgvX2nchUUdfHIT1OvU0isT4Yi h9g/o7i9DnBPjvyNjpXjP9YE7Xg8u2X9mxpv8DyU5M+QpFuofwzsfkCP7g14B0g2 btGxT3AZ5Bo8A/csKeSqHq13Nbq/dcAZZ3IvjIg1xSXjm6CoQ04rfO0TN6SKfsFJ GXteBS2JT1MMcXf3oKweAeQduTE+psVFea7gt/8haKFldnV+DpPNg1gU/7rza5Os eL1+L1iPY5fuEJIkaqPjBVRGkxQqHg== =+OhH -----END PGP SIGNATURE----- Merge tag 'for-6.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mike Snitzer: - Add a dm-crypt optional "high_priority" flag that enables the crypt workqueues to use WQ_HIGHPRI. - Export dm-crypt workqueues via sysfs (by enabling WQ_SYSFS) to allow for improved visibility and controls over IO and crypt workqueues. - Fix dm-crypt to no longer constrain max_segment_size to PAGE_SIZE. This limit isn't needed given that the block core provides late bio splitting if bio exceeds underlying limits (e.g. max_segment_size). - Fix dm-crypt crypt_queue's use of WQ_UNBOUND to not use WQ_CPU_INTENSIVE because it is meaningless with WQ_UNBOUND. - Fix various issues with dm-delay target (ranging from a resource teardown fix, a fix for hung task when using kthread mode, and other improvements that followed from code inspection). * tag 'for-6.10/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm-delay: remove timer_lock dm-delay: change locking to avoid contention dm-delay: fix max_delay calculations dm-delay: fix hung task introduced by kthread mode dm-delay: fix workqueue delay_timer race dm-crypt: don't set WQ_CPU_INTENSIVE for WQ_UNBOUND crypt_queue dm: use queue_limits_set dm-crypt: stop constraining max_segment_size to PAGE_SIZE dm-crypt: export sysfs of all workqueues dm-crypt: add the optional "high_priority" flag	2024-05-14 18:34:19 -07:00
Linus Torvalds	0c9f4ac808	for-6.10/block-20240511 -----BEGIN PGP SIGNATURE----- iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmY/YgsQHGF4Ym9lQGtl cm5lbC5kawAKCRD301j7KXHgpvi0EACwnFRtYioizBH0x7QUHTBcIr0IhACd5gfz bm+uwlDUtf6G6lupHdJT9gOVB2z2z1m2Pz//8RuUVWw3Eqw2+rfgG8iJd+yo7IaV DpX3WaM4NnBvB7FKOKHlMPvGuf7KgbZ3uPm3x8cbrn/axMmkZ6ljxTixJ3p5t4+s xRsef/lVdG71DkXIFgTKATB86yNRJNlRQTbL+sZW22vdXdtfyBbOgR1sBuFfp7Hd g/uocZM/z0ahM6JH/5R2IX2ttKXMIBZLA8HRkJdvYqg022cj4js2YyRCPU3N6jQN MtN4TpJV5I++8l6SPQOOhaDNrK/6zFtDQpwG0YBiKKj3nQDgVbWWb8ejYTIUv4MP SrEto4MVBEqg5N65VwYYhIf45rmueFyJp6z0Vqv6Owur5nuww/YIFknmoMa/WDMd V8dIU3zL72FZDbPjIBjxHeqAGz9OgzEVafled7pi0Xbw6wqiB4kZihlMGXlD+WBy Yd6xo8PX4i5+d2LLKKPxpW1X0eJlKYJ/4dnYCoFN8LmXSiPJnMx2pYrV+NqMxy4X Thr8lxswLQC7j9YBBuIeDl8NB9N5FZZLvaC6I25QKq045M2ckJ+VrounsQb3vGwJ 72nlxxBZL8wz3sasgX9Pc1Cez9AqYbM+UZahq8ezPY5y3Jh0QfRw/MOk1ZaDNC8V CNOHBH0E+Q== =HnjE -----END PGP SIGNATURE----- Merge tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux Pull block updates from Jens Axboe: - Add a partscan attribute in sysfs, fixing an issue with systemd relying on an internal interface that went away. - Attempt #2 at making long running discards interruptible. The previous attempt went into 6.9, but we ended up mostly reverting it as it had issues. - Remove old ida_simple API in bcache - Support for zoned write plugging, greatly improving the performance on zoned devices. - Remove the old throttle low interface, which has been experimental since 2017 and never made it beyond that and isn't being used. - Remove page->index debugging checks in brd, as it hasn't caught anything and prepares us for removing in struct page. - MD pull request from Song - Don't schedule block workers on isolated CPUs * tag 'for-6.10/block-20240511' of git://git.kernel.dk/linux: (84 commits) blk-throttle: delay initialization until configuration blk-throttle: remove CONFIG_BLK_DEV_THROTTLING_LOW block: fix that util can be greater than 100% block: support to account io_ticks precisely block: add plug while submitting IO bcache: fix variable length array abuse in btree_iter bcache: Remove usage of the deprecated ida_simple_xx() API md: Revert "md: Fix overflow in is_mddev_idle" blk-lib: check for kill signal in ioctl BLKDISCARD block: add a bio_await_chain helper block: add a blk_alloc_discard_bio helper block: add a bio_chain_and_submit helper block: move discard checks into the ioctl handler block: remove the discard_granularity check in __blkdev_issue_discard block/ioctl: prefer different overflow check null_blk: Fix the WARNING: modpost: missing MODULE_DESCRIPTION() block: fix and simplify blkdevparts= cmdline parsing block: refine the EOF check in blkdev_iomap_begin block: add a partscan sysfs attribute for disks block: add a disk_has_partscan helper ...	2024-05-13 13:03:54 -07:00
Masahiro Yamada	b1992c3772	kbuild: use $(src) instead of $(srctree)/$(src) for source directory Kbuild conventionally uses $(obj)/ for generated files, and $(src)/ for checked-in source files. It is merely a convention without any functional difference. In fact, $(obj) and $(src) are exactly the same, as defined in scripts/Makefile.build: src := $(obj) When the kernel is built in a separate output directory, $(src) does not accurately reflect the source directory location. While Kbuild resolves this discrepancy by specifying VPATH=$(srctree) to search for source files, it does not cover all cases. For example, when adding a header search path for local headers, -I$(srctree)/$(src) is typically passed to the compiler. This introduces inconsistency between upstream and downstream Makefiles because $(src) is used instead of $(srctree)/$(src) for the latter. To address this inconsistency, this commit changes the semantics of $(src) so that it always points to the directory in the source tree. Going forward, the variables used in Makefiles will have the following meanings: $(obj) - directory in the object tree $(src) - directory in the source tree (changed by this commit) $(objtree) - the top of the kernel object tree $(srctree) - the top of the kernel source tree Consequently, $(srctree)/$(src) in upstream Makefiles need to be replaced with $(src). Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Reviewed-by: Nicolas Schier <nicolas@fjasle.eu>	2024-05-10 04:34:52 +09:00
Benjamin Marzinski	8b21ac87d5	dm-delay: remove timer_lock Instead of manually checking the timer details in queue_timeout(), call timer_reduce() to start the timer or reduce the expiration time. This avoids needing a lock. Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	c542ee1492	dm-delay: change locking to avoid contention The delayed_bios list is protected by one mutex shared by all dm-delay devices. This mutex must be held whenever a bio is added or expired bios are removed from the list. Since a large number of expired bios could be on the list, flush_delayed_bios() can schedule while holding the mutex. This means a flush_delayed_bios() call on any dm-delay device can slow down delay_map() calls on any other dm-delay device. To keep dm-delay devices from slowing each other down and keep processing delay bios from slowing adding delayed bios, the global mutex has been removed, and each dm-delay device now has two locks. delayed_bios_lock is a spinlock that must be held whenever the delayed_bios list is accessed. process_bios_lock is a mutex that must be held whenever a process has temporarily pulled bios off the delayed_bios list to check which ones should be processed. It must be held until all the bios that won't be processed are returned to the list. This is what flush_delayed_bios() now does. The mutex is necessary to guarantee that delay_presuspend() sees the entire list of delayed bios when it calls flush_delayed_bios(). Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	64eb88d6ca	dm-delay: fix max_delay calculations delay_ctr() pointlessly compared max_delay in cases where multiple delay classes were initialized identically. Also, when write delays were configured different than read delays, delay_ctr() never compared their value against max_delay. Fix these issues. Fixes: `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Joel Colledge	d14646f233	dm-delay: fix hung task introduced by kthread mode If the worker thread is not woken due to a bio, then it is not woken at all. This causes the hung task check to trigger. This occurs, for instance, when no bios are submitted. Also when a delay of 0 is configured, delay_bio() returns without waking the worker. Prevent the hung task check from triggering by creating the thread with kthread_run() instead of using kthread_create() directly. Fixes: `70bbeb29fa` ("dm delay: for short delays, use kthread instead of timers and wq") Signed-off-by: Joel Colledge <joel.colledge@linbit.com> Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Benjamin Marzinski	8d24790ed0	dm-delay: fix workqueue delay_timer race delay_timer could be pending when delay_dtr() is called. It needs to be shut down before kdelayd_wq is destroyed, so it won't try queueing more work to kdelayd_wq while that's getting destroyed. Also the del_timer_sync() call in delay_presuspend() doesn't protect against the timer getting immediately rearmed by the queued call to flush_delayed_bios(), but there's no real harm if that does happen. timer_delete() is less work, and is basically just as likely to stop a pointless call to flush_delayed_bios(). Fixes: `26b9f22870` ("dm: delay target") Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com> Signed-off-by: Mike Snitzer <snitzer@kernel.org>	2024-05-09 09:10:58 -04:00
Matthew Mirvish	3a861560cc	bcache: fix variable length array abuse in btree_iter btree_iter is used in two ways: either allocated on the stack with a fixed size MAX_BSETS, or from a mempool with a dynamic size based on the specific cache set. Previously, the struct had a fixed-length array of size MAX_BSETS which was indexed out-of-bounds for the dynamically-sized iterators, which causes UBSAN to complain. This patch uses the same approach as in bcachefs's sort_iter and splits the iterator into a btree_iter with a flexible array member and a btree_iter_stack which embeds a btree_iter as well as a fixed-length data array. Cc: stable@vger.kernel.org Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2039368 Signed-off-by: Matthew Mirvish <matthew@mm12.xyz> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-08 19:15:01 -06:00
Christophe JAILLET	2abd9a197d	bcache: Remove usage of the deprecated ida_simple_xx() API ida_alloc() and ida_free() should be preferred to the deprecated ida_simple_get() and ida_simple_remove(). Note that the upper limit of ida_simple_get() is exclusive, but the one of ida_alloc_max() is inclusive. So a -1 has been added when needed. Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/r/20240509011117.2697-2-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-08 19:15:01 -06:00
Li Nan	504fbcffea	md: Revert "md: Fix overflow in is_mddev_idle" This reverts commit `3f9f231236`. Using 64bit for 'sync_io' is unnecessary from the gendisk side. This overflow will not cause any functional impact, except for a UBSAN warning. Solving this overflow requires introducing additional calculations and checks which are not necessary. So just keep using 32bit for 'sync_io'. Signed-off-by: Li Nan <linan122@huawei.com> Link: https://lore.kernel.org/r/20240507023103.781816-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>	2024-05-07 07:34:08 -06:00
Al Viro	224941e837	use ->bd_mapping instead of ->bd_inode->i_mapping Just the low-hanging fruit... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-2-viro@zeniv.linux.org.uk Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:51 -04:00
Al Viro	dc0fdfc855	dm-vdo: use bdev_nr_bytes(bdev) instead of i_size_read(bdev->bd_inode) going to be faster, actually - shift is cheaper than dereference... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20240411145346.2516848-9-viro@zeniv.linux.org.uk Reviewed-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>	2024-05-03 02:36:21 -04:00

... 4 5 6 7 8 ...

8638 Commits