Commit Graph

8638 Commits

Author SHA1 Message Date
Linus Torvalds 6d8854216e block-6.16-20250606
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmhC7/UQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgps+6D/9BOhkMyMkUF9LAev4PBNE+x3aftjl7Y1AY
 EHv2vozb4nDwXIaalG4qGUhprz2+z+hqxYjmnlOAsqbixhcSzKK5z9rjxyDka776
 x03vfvKXaXZUG7XN7ENY8sJnLx4QJ0nh4+0gzT9yDyq2vKvPFLEKweNOxKDKCSbE
 31vGoLFwjltp74hX+Qrnj1KMaTLgvAaV0eXKWlbX7Iiw6GFVm200zb27gth6U8bV
 WQAmjSkFQ0daHtAWmXIVy7hrXiCqe8D6YPKvBXnQ4cfKVbgG0HHDuTmQLpKGzfMi
 rr24MU5vZjt6OsYalceiTtifSUcf/I2+iFV7HswOk9kpOY5A2ylsWawRP2mm4PDI
 nJE3LaSTRpEvs5kzPJ2kr8Zp4/uvF6ehSq8Y9w52JekmOzxusLcRcswezaO00EI0
 32uuK+P505EGTcCBTrEdtaI6k7zzQEeVoIpxqvMhNRG/s5vzvIV3eVrALu2HSDma
 P3paEdx7PwJla3ndmdChfh1vUR3TW3gWoZvoNCVmJzNCnLEAScTS2NsiQeEjy8zs
 20IGsrRgIqt9KR8GZ2zj1ZOM47Cg0dIU3pbbA2Ja71wx4TYXJCSFFRK7mzDtXYlY
 BWOix/Dks8tk118cwuxnT+IiwmWDMbDZKnygh+4tiSyrs0IszeekRADLUu03C0Ve
 Dhpljqf3zA==
 =gs32
 -----END PGP SIGNATURE-----

Merge tag 'block-6.16-20250606' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - NVMe pull request via Christoph:
      - TCP error handling fix (Shin'ichiro Kawasaki)
      - TCP I/O stall handling fixes (Hannes Reinecke)
      - fix command limits status code (Keith Busch)
      - support vectored buffers also for passthrough (Pavel Begunkov)
      - spelling fixes (Yi Zhang)

 - MD pull request via Yu:
      - fix REQ_RAHEAD and REQ_NOWAIT IO err handling for raid1/10
      - fix max_write_behind setting for dm-raid
      - some minor cleanups

 - Integrity data direction fix and cleanup

 - bcache NULL pointer fix

 - Fix for loop missing write start/end handling

 - Decouple hardware queues and IO threads in ublk

 - Slew of ublk selftests additions and updates

* tag 'block-6.16-20250606' of git://git.kernel.dk/linux: (29 commits)
  nvme: spelling fixes
  nvme-tcp: fix I/O stalls on congested sockets
  nvme-tcp: sanitize request list handling
  nvme-tcp: remove tag set when second admin queue config fails
  nvme: enable vectored registered bufs for passthrough cmds
  nvme: fix implicit bool to flags conversion
  nvme: fix command limits status code
  selftests: ublk: kublk: improve behavior on init failure
  block: flip iter directions in blk_rq_integrity_map_user()
  block: drop direction param from bio_integrity_copy_user()
  selftests: ublk: cover PER_IO_DAEMON in more stress tests
  Documentation: ublk: document UBLK_F_PER_IO_DAEMON
  selftests: ublk: add stress test for per io daemons
  selftests: ublk: add functional test for per io daemons
  selftests: ublk: kublk: decouple ublk_queues from ublk server threads
  selftests: ublk: kublk: move per-thread data out of ublk_queue
  selftests: ublk: kublk: lift queue initialization out of thread
  selftests: ublk: kublk: tie sqe allocation to io instead of queue
  selftests: ublk: kublk: plumb q_id in io_uring user_data
  ublk: have a per-io daemon instead of a per-queue daemon
  ...
2025-06-06 13:12:50 -07:00
Linus Torvalds 3c727285f1 - dm: better error handling when reloading a table
- dm-delay: don't busy-wait in kthread
 
 - dm: use use generic disable_* functions instead of open coding them
 
 - dm: lock queue limits when reading them
 
 - dm-verity: use softirq context only when !need_resched()
 
 - dm-bufio: remove maximum age based eviction
 
 - dm: remove unneeded kvfree from alloc_targets
 
 - dm-flakey: various fixes
 
 - dm-mpath: interface for explicit probing of active paths
 
 - dm: fix BLK_FEAT_ATOMIC_WRITES
 
 - dm: pass through operations on wrapped inline crypto keys
 
 - dm vdo indexer: don't read request structure after enqueuing
 
 - dm-zone: Use bdev_*() helper functions where applicable
 
 - dm-mpath: replace spin_lock_irqsave with spin_lock_irq
 
 - dm-mirror: fix a tiny race condition
 
 - dm-verity: fix a memory leak if some arguments are specified multiple times
 
 - dm-stripe: small code cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaD8vFBQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbYGFAQC/V62PzDUa326WSdvwhtYe6jphInlW
 ZSmh37L4MIcV2wD/S8UqIaC9GSakee6jEWBTRiDqNZNEOIWhbd7f6gnTOQ0=
 =eYUe
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mikulas Patocka:

 - better error handling when reloading a table

 - use use generic disable_* functions instead of open coding them

 - lock queue limits when reading them

 - remove unneeded kvfree from alloc_targets

 - fix BLK_FEAT_ATOMIC_WRITES

 - pass through operations on wrapped inline crypto keys

 - dm-verity:
     - use softirq context only when !need_resched()
     - fix a memory leak if some arguments are specified multiple times

 - dm-mpath:
    - interface for explicit probing of active paths
    - replace spin_lock_irqsave with spin_lock_irq

 - dm-delay: don't busy-wait in kthread

 - dm-bufio: remove maximum age based eviction

 - dm-flakey: various fixes

 - vdo indexer: don't read request structure after enqueuing

 - dm-zone: Use bdev_*() helper functions where applicable

 - dm-mirror: fix a tiny race condition

 - dm-stripe: small code cleanup

* tag 'for-6.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (29 commits)
  dm-stripe: small code cleanup
  dm-verity: fix a memory leak if some arguments are specified multiple times
  dm-mirror: fix a tiny race condition
  dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
  dm mpath: replace spin_lock_irqsave with spin_lock_irq
  dm-mpath: Don't grab work_mutex while probing paths
  dm-zone: Use bdev_*() helper functions where applicable
  dm vdo indexer: don't read request structure after enqueuing
  dm: pass through operations on wrapped inline crypto keys
  blk-crypto: export wrapped key functions
  dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
  dm mpath: Interface for explicit probing of active paths
  dm: Allow .prepare_ioctl to handle ioctls directly
  dm-flakey: make corrupting read bios work
  dm-flakey: remove useless ERROR_READS check in flakey_end_io
  dm-flakey: error all IOs when num_features is absent
  dm-flakey: Clean up parsing messages
  dm: remove unneeded kvfree from alloc_targets
  dm-bufio: remove maximum age based eviction
  dm-verity: use softirq context only when !need_resched()
  ...
2025-06-03 15:54:46 -07:00
Mikulas Patocka 9f2f6316d7 dm-stripe: small code cleanup
This commit doesn't fix any bug, it is just code cleanup. Use the
function format_dev_t instead of sprintf, because format_dev_t does the
same thing.

Remove the useless memset call.

An unsigned integer can take at most 10 digits, so extend the array size
to 22. (note that because the range of minor and major numbers is limited,
the size 16 could not be exceeded, thus this function couldn't write
beyond string end)

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-03 19:06:32 +02:00
Mikulas Patocka 66be40a14e dm-verity: fix a memory leak if some arguments are specified multiple times
If some of the arguments "check_at_most_once", "ignore_zero_blocks",
"use_fec_from_device", "root_hash_sig_key_desc" were specified more than
once on the target line, a memory leak would happen.

This commit fixes the memory leak. It also fixes error handling in
verity_verify_sig_parse_opt_args.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-06-03 19:01:42 +02:00
Mikulas Patocka 829451beae dm-mirror: fix a tiny race condition
There's a tiny race condition in dm-mirror. The functions queue_bio and
write_callback grab a spinlock, add a bio to the list, drop the spinlock
and wake up the mirrord thread that processes bios in the list.

It may be possible that the mirrord thread processes the bio just after
spin_unlock_irqrestore is called, before wakeup_mirrord. This spurious
wake-up is normally harmless, however if the device mapper device is
unloaded just after the bio was processed, it may be possible that
wakeup_mirrord(ms) uses invalid "ms" pointer.

Fix this bug by moving wakeup_mirrord inside the spinlock.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-06-03 19:01:23 +02:00
Benjamin Marzinski 85f6d5b729 dm-table: check BLK_FEAT_ATOMIC_WRITES inside limits_lock
dm_set_device_limits() should check q->limits.features for
BLK_FEAT_ATOMIC_WRITES while holding q->limits_lock, like it does for
the rest of the queue limits.

Fixes: b7c18b17a1 ("dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-06-02 14:24:51 +02:00
Linus Torvalds 7d4e49a77d - The 3 patch series "hung_task: extend blocking task stacktrace dump to
semaphore" from Lance Yang enhances the hung task detector.  The
   detector presently dumps the blocking tasks's stack when it is blocked
   on a mutex.  Lance's series extends this to semaphores.
 
 - The 2 patch series "nilfs2: improve sanity checks in dirty state
   propagation" from Wentao Liang addresses a couple of minor flaws in
   nilfs2.
 
 - The 2 patch series "scripts/gdb: Fixes related to lx_per_cpu()" from
   Illia Ostapyshyn fixes a couple of issues in the gdb scripts.
 
 - The 9 patch series "Support kdump with LUKS encryption by reusing LUKS
   volume keys" from Coiby Xu addresses a usability problem with kdump.
   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem.  A full writeup of this is in the
   series [0/N] cover letter.
 
 - The 2 patch series "sysfs: add counters for lockups and stalls" from
   Max Kellermann adds /sys/kernel/hardlockup_count and
   /sys/kernel/hardlockup_count and /sys/kernel/rcu_stall_count.
 
 - The 3 patch series "fork: Page operation cleanups in the fork code"
   from Pasha Tatashin implements a number of code cleanups in fork.c.
 
 - The 3 patch series "scripts/gdb/symbols: determine KASLR offset on
   s390 during early boot" from Ilya Leoshkevich fixes some s390 issues in
   the gdb scripts.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCaDuCvQAKCRDdBJ7gKXxA
 jrkxAQCnFAp/uK9ckkbN4nfpJ0+OMY36C+A+dawSDtuRsIkXBAEAq3e6MNAUdg5W
 Ca0cXdgSIq1Op7ZKEA+66Km6Rfvfow8=
 =g45L
 -----END PGP SIGNATURE-----

Merge tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull non-MM updates from Andrew Morton:

 - "hung_task: extend blocking task stacktrace dump to semaphore" from
   Lance Yang enhances the hung task detector.

   The detector presently dumps the blocking tasks's stack when it is
   blocked on a mutex. Lance's series extends this to semaphores

 - "nilfs2: improve sanity checks in dirty state propagation" from
   Wentao Liang addresses a couple of minor flaws in nilfs2

 - "scripts/gdb: Fixes related to lx_per_cpu()" from Illia Ostapyshyn
   fixes a couple of issues in the gdb scripts

 - "Support kdump with LUKS encryption by reusing LUKS volume keys" from
   Coiby Xu addresses a usability problem with kdump.

   When the dump device is LUKS-encrypted, the kdump kernel may not have
   the keys to the encrypted filesystem. A full writeup of this is in
   the series [0/N] cover letter

 - "sysfs: add counters for lockups and stalls" from Max Kellermann adds
   /sys/kernel/hardlockup_count and /sys/kernel/hardlockup_count and
   /sys/kernel/rcu_stall_count

 - "fork: Page operation cleanups in the fork code" from Pasha Tatashin
   implements a number of code cleanups in fork.c

 - "scripts/gdb/symbols: determine KASLR offset on s390 during early
   boot" from Ilya Leoshkevich fixes some s390 issues in the gdb
   scripts

* tag 'mm-nonmm-stable-2025-05-31-15-28' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (67 commits)
  llist: make llist_add_batch() a static inline
  delayacct: remove redundant code and adjust indentation
  squashfs: add optional full compressed block caching
  crash_dump, nvme: select CONFIGFS_FS as built-in
  scripts/gdb/symbols: determine KASLR offset on s390 during early boot
  scripts/gdb/symbols: factor out pagination_off()
  scripts/gdb/symbols: factor out get_vmlinux()
  kernel/panic.c: format kernel-doc comments
  mailmap: update and consolidate Casey Connolly's name and email
  nilfs2: remove wbc->for_reclaim handling
  fork: define a local GFP_VMAP_STACK
  fork: check charging success before zeroing stack
  fork: clean-up naming of vm_stack/vm_struct variables in vmap stacks code
  fork: clean-up ifdef logic around stack allocation
  kernel/rcu/tree_stall: add /sys/kernel/rcu_stall_count
  kernel/watchdog: add /sys/kernel/{hard,soft}lockup_count
  x86/crash: make the page that stores the dm crypt keys inaccessible
  x86/crash: pass dm crypt keys to kdump kernel
  Revert "x86/mm: Remove unused __set_memory_prot()"
  crash_dump: retrieve dm crypt keys in kdump kernel
  ...
2025-05-31 19:12:53 -07:00
Yu Kuai 01bf468c4e md/md-bitmap: remove parameter slot from bitmap_create()
All callers pass in '-1' for 'slot', hence it can be removed.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai 38f520a37d md/md-bitmap: cleanup bitmap_ops->startwrite()
bitmap_startwrite() always return 0, and the caller doesn't check return
value as well, hence change the method to void.

Also rename startwrite/endwrite to start_write/end_write, which is more in
line with the usual naming convention.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai b886475804 md/dm-raid: remove max_write_behind setting limit
The comments said 'vaule in kB', while the value actually means the
number of write_behind IOs. And since md-bitmap will automatically
adjust the value to max COUNTER_MAX / 2, there is no need to fail
early.

Also move some macros that is only used md-bitmap.c.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-15-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-30 15:47:23 +08:00
Yu Kuai 2afe17794c md/md-bitmap: fix dm-raid max_write_behind setting
It's supposed to be COUNTER_MAX / 2, not COUNTER_MAX.

Link: https://lore.kernel.org/linux-raid/20250524061320.370630-14-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
2025-05-30 15:47:23 +08:00
Yu Kuai 9f346f7d4e md/raid1,raid10: don't handle IO error for REQ_RAHEAD and REQ_NOWAIT
IO with REQ_RAHEAD or REQ_NOWAIT can fail early, even if the storage medium
is fine, hence record badblocks or remove the disk from array does not
make sense.

This problem if found by lvm2 test lvcreate-large-raid, where dm-zero
will fail read ahead IO directly.

Fixes: e879a0d9cb ("md/raid1,raid10: don't ignore IO flags")
Reported-and-tested-by: Mikulas Patocka <mpatocka@redhat.com>
Closes: https://lore.kernel.org/all/34fa755d-62c8-4588-8ee1-33cb1249bdf2@redhat.com/
Link: https://lore.kernel.org/linux-raid/20250527081407.3004055-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-05-30 15:46:45 +08:00
Linus Torvalds 48cfc5791d hardening updates for v6.16-rc1
- Update overflow helpers to ease refactoring of on-stack flex array
   instances (Gustavo A. R. Silva, Kees Cook)
 
 - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)
 
 - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)
 
 - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)
 
 - Add missed designated initializers now exposed by fixed randstruct
   (Nathan Chancellor, Kees Cook)
 
 - Document compilers versions for __builtin_dynamic_object_size
 
 - Remove ARM_SSP_PER_TASK GCC plugin
 
 - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
   builds
 
 - Kbuild: induce full rebuilds when dependencies change with GCC plugins,
   the Clang sanitizer .scl file, or the randstruct seed.
 
 - Kbuild: Switch from -Wvla to -Wvla-larger-than=1
 
 - Correct several __nonstring uses for -Wunterminated-string-initialization
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRSPkdeREjth1dHnSE2KwveOeQkuwUCaDUq9gAKCRA2KwveOeQk
 u+ZCAQDhqpOE/yn5gfjyplIvaTtzj9CaW6g11AmPYrimJCuj3QD9G+0o35kzlXOw
 f0ZIj2U7LFNgbLos+20hQwhMFf1Zhgg=
 =OYzD
 -----END PGP SIGNATURE-----

Merge tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull hardening updates from Kees Cook:

 - Update overflow helpers to ease refactoring of on-stack flex array
   instances (Gustavo A. R. Silva, Kees Cook)

 - lkdtm: Use SLAB_NO_MERGE instead of constructors (Harry Yoo)

 - Simplify CONFIG_CC_HAS_COUNTED_BY (Jan Hendrik Farr)

 - Disable u64 usercopy KUnit test on 32-bit SPARC (Thomas Weißschuh)

 - Add missed designated initializers now exposed by fixed randstruct
   (Nathan Chancellor, Kees Cook)

 - Document compilers versions for __builtin_dynamic_object_size

 - Remove ARM_SSP_PER_TASK GCC plugin

 - Fix GCC plugin randstruct, add selftests, and restore COMPILE_TEST
   builds

 - Kbuild: induce full rebuilds when dependencies change with GCC
   plugins, the Clang sanitizer .scl file, or the randstruct seed.

 - Kbuild: Switch from -Wvla to -Wvla-larger-than=1

 - Correct several __nonstring uses for -Wunterminated-string-initialization

* tag 'hardening-v6.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (23 commits)
  Revert "hardening: Disable GCC randstruct for COMPILE_TEST"
  lib/tests: randstruct: Add deep function pointer layout test
  lib/tests: Add randstruct KUnit test
  randstruct: gcc-plugin: Remove bogus void member
  net: qede: Initialize qede_ll_ops with designated initializer
  scsi: qedf: Use designated initializer for struct qed_fcoe_cb_ops
  md/bcache: Mark __nonstring look-up table
  integer-wrap: Force full rebuild when .scl file changes
  randstruct: Force full rebuild when seed changes
  gcc-plugins: Force full rebuild when plugins change
  kbuild: Switch from -Wvla to -Wvla-larger-than=1
  hardening: simplify CONFIG_CC_HAS_COUNTED_BY
  overflow: Fix direct struct member initialization in _DEFINE_FLEX()
  kunit/overflow: Add tests for STACK_FLEX_ARRAY_SIZE() helper
  overflow: Add STACK_FLEX_ARRAY_SIZE() helper
  input/joystick: magellan: Mark __nonstring look-up table const
  watchdog: exar: Shorten identity name to fit correctly
  mod_devicetable: Enlarge the maximum platform_device_id name length
  overflow: Clarify expectations for getting DEFINE_FLEX variable sizes
  compiler_types: Identify compiler versions for __builtin_dynamic_object_size
  ...
2025-05-28 07:47:10 -07:00
Mingzhe Zou 208c1559c5 bcache: reserve more RESERVE_BTREE buckets to prevent allocator hang
Reported an IO hang and unrecoverable error in our testing environment.

After careful research, we found that bch_allocator_thread is stuck,
the call stack is as follows:
[<0>] __switch_to+0xbc/0x108
[<0>] __closure_sync+0x7c/0xbc [bcache]
[<0>] bch_prio_write+0x430/0x448 [bcache]
[<0>] bch_allocator_thread+0xb44/0xb70 [bcache]
[<0>] kthread+0x124/0x130
[<0>] ret_from_fork+0x10/0x18

Moreover, the RESERVE_BTREE type bucket slot are empty and journal_full
occurs at the same time.

When the cache disk is first used, the sb.nJournal_buckets defaults to 0.
So, only 8 RESERVE_BTREE type buckets are reserved. If RESERVE_BTREE type
buckets used up or btree_check_reserve() failed when request handle btree
split, the request will be repeatedly retried and wait for alloc thread to
fill in.

After the alloc thread fills the buckets, it will call bch_prio_write().
If journal_full occurs simultaneously at this time, journal_reclaim() and
btree_flush_write() will be called sequentially, journal_write cannot be
completed.

This is a low probability event, we believe that reserve more RESERVE_BTREE
buckets can avoid the worst situation.

Fixes: 682811b3ce ("bcache: fix for allocator and register thread race")
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-4-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Robert Pang 5a08e49f23 bcache: remove unused constants
Remove constants MAX_NEED_GC and MAX_SAVE_PRIO in btree.c that have been unused
since initial commit.

Signed-off-by: Robert Pang <robertpang@google.com>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-3-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Linggang Zeng 1e46ed947e bcache: fix NULL pointer in cache_set_flush()
1. LINE#1794 - LINE#1887 is some codes about function of
   bch_cache_set_alloc().
2. LINE#2078 - LINE#2142 is some codes about function of
   register_cache_set().
3. register_cache_set() will call bch_cache_set_alloc() in LINE#2098.

 1794 struct cache_set *bch_cache_set_alloc(struct cache_sb *sb)
 1795 {
 ...
 1860         if (!(c->devices = kcalloc(c->nr_uuids, sizeof(void *), GFP_KERNEL)) ||
 1861             mempool_init_slab_pool(&c->search, 32, bch_search_cache) ||
 1862             mempool_init_kmalloc_pool(&c->bio_meta, 2,
 1863                                 sizeof(struct bbio) + sizeof(struct bio_vec) *
 1864                                 bucket_pages(c)) ||
 1865             mempool_init_kmalloc_pool(&c->fill_iter, 1, iter_size) ||
 1866             bioset_init(&c->bio_split, 4, offsetof(struct bbio, bio),
 1867                         BIOSET_NEED_BVECS|BIOSET_NEED_RESCUER) ||
 1868             !(c->uuids = alloc_bucket_pages(GFP_KERNEL, c)) ||
 1869             !(c->moving_gc_wq = alloc_workqueue("bcache_gc",
 1870                                                 WQ_MEM_RECLAIM, 0)) ||
 1871             bch_journal_alloc(c) ||
 1872             bch_btree_cache_alloc(c) ||
 1873             bch_open_buckets_alloc(c) ||
 1874             bch_bset_sort_state_init(&c->sort, ilog2(c->btree_pages)))
 1875                 goto err;
                      ^^^^^^^^
 1876
 ...
 1883         return c;
 1884 err:
 1885         bch_cache_set_unregister(c);
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 1886         return NULL;
 1887 }
 ...
 2078 static const char *register_cache_set(struct cache *ca)
 2079 {
 ...
 2098         c = bch_cache_set_alloc(&ca->sb);
 2099         if (!c)
 2100                 return err;
                      ^^^^^^^^^^
 ...
 2128         ca->set = c;
 2129         ca->set->cache[ca->sb.nr_this_dev] = ca;
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 ...
 2138         return NULL;
 2139 err:
 2140         bch_cache_set_unregister(c);
 2141         return err;
 2142 }

(1) If LINE#1860 - LINE#1874 is true, then do 'goto err'(LINE#1875) and
    call bch_cache_set_unregister()(LINE#1885).
(2) As (1) return NULL(LINE#1886), LINE#2098 - LINE#2100 would return.
(3) As (2) has returned, LINE#2128 - LINE#2129 would do *not* give the
    value to c->cache[], it means that c->cache[] is NULL.

LINE#1624 - LINE#1665 is some codes about function of cache_set_flush().
As (1), in LINE#1885 call
bch_cache_set_unregister()
---> bch_cache_set_stop()
     ---> closure_queue()
          -.-> cache_set_flush() (as below LINE#1624)

 1624 static void cache_set_flush(struct closure *cl)
 1625 {
 ...
 1654         for_each_cache(ca, c, i)
 1655                 if (ca->alloc_thread)
                          ^^
 1656                         kthread_stop(ca->alloc_thread);
 ...
 1665 }

(4) In LINE#1655 ca is NULL(see (3)) in cache_set_flush() then the
    kernel crash occurred as below:
[  846.712887] bcache: register_cache() error drbd6: cannot allocate memory
[  846.713242] bcache: register_bcache() error : failed to register device
[  846.713336] bcache: cache_set_free() Cache set 2f84bdc1-498a-4f2f-98a7-01946bf54287 unregistered
[  846.713768] BUG: unable to handle kernel NULL pointer dereference at 00000000000009f8
[  846.714790] PGD 0 P4D 0
[  846.715129] Oops: 0000 [#1] SMP PTI
[  846.715472] CPU: 19 PID: 5057 Comm: kworker/19:16 Kdump: loaded Tainted: G           OE    --------- -  - 4.18.0-147.5.1.el8_1.5es.3.x86_64 #1
[  846.716082] Hardware name: ESPAN GI-25212/X11DPL-i, BIOS 2.1 06/15/2018
[  846.716451] Workqueue: events cache_set_flush [bcache]
[  846.716808] RIP: 0010:cache_set_flush+0xc9/0x1b0 [bcache]
[  846.717155] Code: 00 4c 89 a5 b0 03 00 00 48 8b 85 68 f6 ff ff a8 08 0f 84 88 00 00 00 31 db 66 83 bd 3c f7 ff ff 00 48 8b 85 48 ff ff ff 74 28 <48> 8b b8 f8 09 00 00 48 85 ff 74 05 e8 b6 58 a2 e1 0f b7 95 3c f7
[  846.718026] RSP: 0018:ffffb56dcf85fe70 EFLAGS: 00010202
[  846.718372] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  846.718725] RDX: 0000000000000001 RSI: 0000000040000001 RDI: 0000000000000000
[  846.719076] RBP: ffffa0ccc0f20df8 R08: ffffa0ce1fedb118 R09: 000073746e657665
[  846.719428] R10: 8080808080808080 R11: 0000000000000000 R12: ffffa0ce1fee8700
[  846.719779] R13: ffffa0ccc0f211a8 R14: ffffa0cd1b902840 R15: ffffa0ccc0f20e00
[  846.720132] FS:  0000000000000000(0000) GS:ffffa0ce1fec0000(0000) knlGS:0000000000000000
[  846.720726] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  846.721073] CR2: 00000000000009f8 CR3: 00000008ba00a005 CR4: 00000000007606e0
[  846.721426] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  846.721778] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  846.722131] PKRU: 55555554
[  846.722467] Call Trace:
[  846.722814]  process_one_work+0x1a7/0x3b0
[  846.723157]  worker_thread+0x30/0x390
[  846.723501]  ? create_worker+0x1a0/0x1a0
[  846.723844]  kthread+0x112/0x130
[  846.724184]  ? kthread_flush_work_fn+0x10/0x10
[  846.724535]  ret_from_fork+0x35/0x40

Now, check whether that ca is NULL in LINE#1655 to fix the issue.

Signed-off-by: Linggang Zeng <linggang.zeng@easystack.cn>
Signed-off-by: Mingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250527051601.74407-2-colyli@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-27 07:38:19 -06:00
Linus Torvalds 6f59de9bc0 for-6.16/block-20250523
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgwnGYQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpq9aD/4iqOts77xhWWLrOJWkkhOcV5rREeyppq8X
 MKYul9S4cc4Uin9Xou9a+nab31QBQEk3nsN3kX9o3yAXvkh6yUm36HD8qYNW/46q
 IUkwRQQJ0COyTnexMZQNTbZPQDIYcenXmQxOcrEJ5jC1Jcz0sOKHsgekL+ab3kCy
 fLnuz2ozvjGDMala/NmE8fN5qSlj4qQABHgbamwlwfo4aWu07cwfqn5G/FCYJgDO
 xUvsnTVclom2g4G+7eSSvGQI1QyAxl5QpviPnj/TEgfFBFnhbCSoBTEY6ecqhlfW
 6u59MF/Uw8E+weiuGY4L87kDtBhjQs3UMSLxCuwH7MxXb25ff7qB4AIkcFD0kKFH
 3V5NtwqlU7aQT0xOjGxaHhfPwjLD+FVss4ARmuHS09/Kn8egOW9yROPyetnuH84R
 Oz0Ctnt1IPLFjvGeg3+rt9fjjS9jWOXLITb9Q6nX9gnCt7orCwIYke8YCpmnJyhn
 i+fV4CWYIQBBRKxIT0E/GhJxZOmL0JKpomnbpP2dH8npemnsTCuvtfdrK9gfhH2X
 chBVqCPY8MNU5zKfzdEiavPqcm9392lMzOoOXW2pSC1eAKqnAQ86ZT3r7rLntqE8
 75LxHcvaQIsnpyG+YuJVHvoiJ83TbqZNpyHwNaQTYhDmdYpp2d/wTtTQywX4DuXb
 Y6NDJw5+kQ==
 =1PNK
 -----END PGP SIGNATURE-----

Merge tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - ublk updates:
      - Add support for updating the size of a ublk instance
      - Zero-copy improvements
      - Auto-registering of buffers for zero-copy
      - Series simplifying and improving GET_DATA and request lookup
      - Series adding quiesce support
      - Lots of selftests additions
      - Various cleanups

 - NVMe updates via Christoph:
      - add per-node DMA pools and use them for PRP/SGL allocations
        (Caleb Sander Mateos, Keith Busch)
      - nvme-fcloop refcounting fixes (Daniel Wagner)
      - support delayed removal of the multipath node and optionally
        support the multipath node for private namespaces (Nilay Shroff)
      - support shared CQs in the PCI endpoint target code (Wilfred
        Mallawa)
      - support admin-queue only authentication (Hannes Reinecke)
      - use the crc32c library instead of the crypto API (Eric Biggers)
      - misc cleanups (Christoph Hellwig, Marcelo Moreira, Hannes
        Reinecke, Leon Romanovsky, Gustavo A. R. Silva)

 - MD updates via Yu:
      - Fix that normal IO can be starved by sync IO, found by mkfs on
        newly created large raid5, with some clean up patches for bdev
        inflight counters

 - Clean up brd, getting rid of atomic kmaps and bvec poking

 - Add loop driver specifically for zoned IO testing

 - Eliminate blk-rq-qos calls with a static key, if not enabled

 - Improve hctx locking for when a plug has IO for multiple queues
   pending

 - Remove block layer bouncing support, which in turn means we can
   remove the per-node bounce stat as well

 - Improve blk-throttle support

 - Improve delay support for blk-throttle

 - Improve brd discard support

 - Unify IO scheduler switching. This should also fix a bunch of lockdep
   warnings we've been seeing, after enabling lockdep support for queue
   freezing/unfreezeing

 - Add support for block write streams via FDP (flexible data placement)
   on NVMe

 - Add a bunch of block helpers, facilitating the removal of a bunch of
   duplicated boilerplate code

 - Remove obsolete BLK_MQ pci and virtio Kconfig options

 - Add atomic/untorn write support to blktrace

 - Various little cleanups and fixes

* tag 'for-6.16/block-20250523' of git://git.kernel.dk/linux: (186 commits)
  selftests: ublk: add test for UBLK_F_QUIESCE
  ublk: add feature UBLK_F_QUIESCE
  selftests: ublk: add test case for UBLK_U_CMD_UPDATE_SIZE
  traceevent/block: Add REQ_ATOMIC flag to block trace events
  ublk: run auto buf unregisgering in same io_ring_ctx with registering
  io_uring: add helper io_uring_cmd_ctx_handle()
  ublk: remove io argument from ublk_auto_buf_reg_fallback()
  ublk: handle ublk_set_auto_buf_reg() failure correctly in ublk_fetch()
  selftests: ublk: add test for covering UBLK_AUTO_BUF_REG_FALLBACK
  selftests: ublk: support UBLK_F_AUTO_BUF_REG
  ublk: support UBLK_AUTO_BUF_REG_FALLBACK
  ublk: register buffer to local io_uring with provided buf index via UBLK_F_AUTO_BUF_REG
  ublk: prepare for supporting to register request buffer automatically
  ublk: convert to refcount_t
  selftests: ublk: make IO & device removal test more stressful
  nvme: rename nvme_mpath_shutdown_disk to nvme_mpath_remove_disk
  nvme: introduce multipath_always_on module param
  nvme-multipath: introduce delayed removal of the multipath head node
  nvme-pci: derive and better document max segments limits
  nvme-pci: use struct_size for allocation struct nvme_dev
  ...
2025-05-26 11:39:36 -07:00
Mikulas Patocka 050a3e71ce dm mpath: replace spin_lock_irqsave with spin_lock_irq
Replace spin_lock_irqsave/spin_unlock_irqrestore with
spin_lock_irq/spin_unlock_irq at places where it is known that interrupts
are enabled.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
2025-05-22 15:56:42 +02:00
Benjamin Marzinski 5c977f1023 dm-mpath: Don't grab work_mutex while probing paths
Grabbing the work_mutex keeps probe_active_paths() from running at the
same time as multipath_message(). The only messages that could interfere
with probing the paths are "disable_group", "enable_group", and
"switch_group". These messages could force multipath to pick a new
pathgroup while probe_active_paths() was probing the current pathgroup.
If the multipath device has a hardware handler, and it switches active
pathgroups while there is outstanding IO to a path device, it's possible
that IO to the path will fail, even if the path would be usable if it
was in the active pathgroup. To avoid this, do not clear the current
pathgroup for the *_group messages while probe_active_paths() is
running. Instead set a flag, and probe_active_paths() will clear the
current pathgroup when it finishes probing the paths. For this to work
correctly, multipath needs to check current_pg before next_pg in
choose_pgpath(), but before this patch next_pg was only ever set when
current_pg was cleared, so this doesn't change the current behavior when
paths aren't being probed. Even with this change, it is still possible
to switch pathgroups while the probe is running, but only if all the
paths have failed, and the probe function will skip them as well in this
case.

If multiple DM_MPATH_PROBE_PATHS requests are received at once, there is
no point in repeatedly issuing test IOs. Instead, the later probes
should wait for the current probe to complete. If current pathgroup is
still the same as the one that was just checked, the other probes should
skip probing and just check the number of valid paths.  Finally, probing
the paths should quit early if the multipath device is trying to
suspend, instead of continuing to issue test IOs, delaying the suspend.

While this patch will not change the behavior of existing multipath
users which don't use the DM_MPATH_PROBE_PATHS ioctl, when that ioctl
is used, the behavior of the "disable_group", "enable_group", and
"switch_group" messages can change subtly. When these messages return,
the next IO to the multipath device will no longer be guaranteed to
choose a new pathgroup. Instead, choosing a new pathgroup could be
delayed by an in-progress DM_MPATH_PROBE_PATHS ioctl. The userspace
multipath tools make no assumptions about what will happen to IOs after
sending these messages, so this change will not effect already released
versions of them, even if the DM_MPATH_PROBE_PATHS ioctl is run
alongside them.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-16 13:23:45 +02:00
Bart Van Assche 241b9b584d dm-zone: Use bdev_*() helper functions where applicable
Improve code readability by using bdev_is_zone_aligned() and
bdev_offset_from_zone_start() where applicable. No functionality
has been changed.

This patch is a reworked version of a patch from Pankaj Raghav.

See also https://lore.kernel.org/linux-block/20220923173618.6899-11-p.raghav@samsung.com/.

Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-15 15:55:07 +02:00
Matthew Sakai 3da732687d dm vdo indexer: don't read request structure after enqueuing
The function get_volume_page_protected may place a request on
a queue for another thread to process asynchronously. When this
happens, the volume should not read the request from the original
thread. This can not currently cause problems, due to the way
request processing is handled, but it is not safe in general.

Reviewed-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-15 15:54:47 +02:00
Fedor Pchelkin f3def8270c sort.h: hoist cmp_int() into generic header file
Deduplicate the same functionality implemented in several places by
moving the cmp_int() helper macro into linux/sort.h.

The macro performs a three-way comparison of the arguments mostly useful
in different sorting strategies and algorithms.

Link: https://lkml.kernel.org/r/20250427201451.900730-1-pchelkin@ispras.ru
Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru>
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Acked-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Coly Li <colyli@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Carlos Maiolino <cem@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Coly Li <colyli@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-05-11 17:54:12 -07:00
Yu Kuai 752d0464b7 md: clean up accounting for issued sync IO
It's no longer used and can be removed, also remove the field
'gendisk->sync_io'.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:14:22 +08:00
Yu Kuai e5797ae703 md: fix is_mddev_idle()
If sync_speed is above speed_min, then is_mddev_idle() will be called
for each sync IO to check if the array is idle, and inflight sync_io
will be limited if the array is not idle.

However, while mkfs.ext4 for a large raid5 array while recovery is in
progress, it's found that sync_speed is already above speed_min while
lots of stripes are used for sync IO, causing long delay for mkfs.ext4.

Root cause is the following checking from is_mddev_idle():

t1: submit sync IO: events1 = completed IO - issued sync IO
t2: submit next sync IO: events2  = completed IO - issued sync IO
if (events2 - events1 > 64)

For consequence, the more sync IO issued, the less likely checking will
pass. And when completed normal IO is more than issued sync IO, the
condition will finally pass and is_mddev_idle() will return false,
however, last_events will be updated hence is_mddev_idle() can only
return false once in a while.

Fix this problem by changing the checking as following:

1) mddev doesn't have normal IO completed;
2) mddev doesn't have normal IO inflight;
3) if any member disks is partition, and all other partitions doesn't
   have IO completed.

Also change rdev->last_events to unsigned long to cleanup type casting.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:13:31 +08:00
Yu Kuai 03720d82d7 md: add a new api sync_io_depth
Currently if sync speed is above speed_min and below speed_max,
md_do_sync() will wait for all sync IOs to be done before issuing new
sync IO, means sync IO depth is limited to just 1.

This limit is too low, in order to prevent sync speed drop conspicuously
after fixing is_mddev_idle() in the next patch, add a new api for
limiting sync IO depth, the default value is 32.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:12:52 +08:00
Yu Kuai 7168be3c8a md: record dm-raid gendisk in mddev
Following patch will use gendisk to check if there are normal IO
completed or inflight, to fix a problem in mdraid that foreground IO
can be starved by background sync IO in later patches.

Link: https://lore.kernel.org/linux-raid/20250506124903.2540268-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
2025-05-10 16:12:19 +08:00
Kees Cook 82d76bf938 md/bcache: Mark __nonstring look-up table
GCC 15's new -Wunterminated-string-initialization notices that the 16
character lookup table "zero_uuid" (which is not used as a C-String)
needs to be marked as "nonstring":

drivers/md/bcache/super.c: In function 'uuid_find_empty':
drivers/md/bcache/super.c:549:43: warning: initializer-string for array of 'char' truncates NUL terminator but destination lacks 'nonstring' attribute (17 chars into 16 available) [-Wunterminated-string-initialization]
  549 |         static const char zero_uuid[16] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0";
      |                                           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Add the annotation (since it is not used as a C-String), and switch the
initializer to an array of bytes rather than an empty initializer,
as preferred by Coly Li.

Suggested-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/lkml/389A9925-0990-422C-A1B3-0195FAA73288@coly.li/
Signed-off-by: Kees Cook <kees@kernel.org>
2025-05-08 09:42:06 -07:00
Christoph Hellwig bd4e709b32 dm-integrity: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it, and do the same for the
similar pattern using bio_add_page for adding the first segment after
a bio allocation as that can't fail either.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-15-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Christoph Hellwig 9134124ce1 dm-bufio: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-14-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Christoph Hellwig 23f5d69dfa bcache: use bio_add_virt_nofail
Convert the __bio_add_page(..., virt_to_page(), ...) pattern to the
bio_add_virt_nofail helper implementing it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Coly Li <colyli@kernel.org>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250507120451.4000627-9-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-05-07 07:31:07 -06:00
Eric Biggers e93912786e dm: pass through operations on wrapped inline crypto keys
Make the device-mapper layer pass through the derive_sw_secret,
import_key, generate_key, and prepare_key blk-crypto operations when all
underlying devices support hardware-wrapped inline crypto keys and are
passing through inline crypto support.

Commit ebc4176551 ("blk-crypto: add basic hardware-wrapped key
support") already made BLK_CRYPTO_KEY_TYPE_HW_WRAPPED be passed through
in the same way that the other crypto capabilities are.  But the wrapped
key support also includes additional operations in blk_crypto_ll_ops,
and the dm layer needs to implement those to pass them through.
derive_sw_secret is needed by fscrypt, while the other operations are
needed for the new blk-crypto ioctls to work on device-mapper devices
and not just the raw partitions.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-06 19:08:20 +02:00
Linus Torvalds cccd033714 - dm: fix reading past the end of allocated memory
- dm: fix missing dm_put_live_table() in dm_keyslot_evict()
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaBn8PhQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbVCUAQDDMCRu68hiL5SWai9YXhw40rPTuC7k
 e/zHIwRsObItgAD/YvRH1d85XBhQY5x3PCHa3j1u9q+S3uF4naG1n1afqw8=
 =L5lI
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - fix reading past the end of allocated memory

 - fix missing dm_put_live_table() in dm_keyslot_evict()

* tag 'for-6.15/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: fix copying after src array boundaries
  dm: add missing unlock on in dm_keyslot_evict()
2025-05-06 08:14:20 -07:00
Tudor Ambarus f1aff4bc19 dm: fix copying after src array boundaries
The blammed commit copied to argv the size of the reallocated argv,
instead of the size of the old_argv, thus reading and copying from
past the old_argv allocated memory.

Following BUG_ON was hit:
[    3.038929][    T1] kernel BUG at lib/string_helpers.c:1040!
[    3.039147][    T1] Internal error: Oops - BUG: 00000000f2000800 [#1]  SMP
...
[    3.056489][    T1] Call trace:
[    3.056591][    T1]  __fortify_panic+0x10/0x18 (P)
[    3.056773][    T1]  dm_split_args+0x20c/0x210
[    3.056942][    T1]  dm_table_add_target+0x13c/0x360
[    3.057132][    T1]  table_load+0x110/0x3ac
[    3.057292][    T1]  dm_ctl_ioctl+0x424/0x56c
[    3.057457][    T1]  __arm64_sys_ioctl+0xa8/0xec
[    3.057634][    T1]  invoke_syscall+0x58/0x10c
[    3.057804][    T1]  el0_svc_common+0xa8/0xdc
[    3.057970][    T1]  do_el0_svc+0x1c/0x28
[    3.058123][    T1]  el0_svc+0x50/0xac
[    3.058266][    T1]  el0t_64_sync_handler+0x60/0xc4
[    3.058452][    T1]  el0t_64_sync+0x1b0/0x1b4
[    3.058620][    T1] Code: f800865e a9bf7bfd 910003fd 941f48aa (d4210000)
[    3.058897][    T1] ---[ end trace 0000000000000000 ]---
[    3.059083][    T1] Kernel panic - not syncing: Oops - BUG: Fatal exception

Fix it by copying the size of src, and not the size of dst, as it was.

Fixes: 5a2a6c4281 ("dm: always update the array size in realloc_argv on success")
Cc: stable@vger.kernel.org
Signed-off-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-06 14:06:59 +02:00
John Garry b7c18b17a1 dm-table: Set BLK_FEAT_ATOMIC_WRITES for target queue limits
Feature flag BLK_FEAT_ATOMIC_WRITES is not being properly set for the
target queue limits, and this means that atomic writes are not being
enabled for any dm personalities.

When calling dm_set_device_limits() -> blk_stack_limits() ->
... -> blk_stack_atomic_writes_limits(), the bottom device limits
(which corresponds to intermediate target queue limits) does not have
BLK_FEAT_ATOMIC_WRITES set, and so atomic writes can never be enabled.

Typically such a flag would be inherited from the stacked device in
dm_set_device_limits() -> blk_stack_limits() via BLK_FEAT_INHERIT_MASK,
but BLK_FEAT_ATOMIC_WRITES is not inherited as it's preferred to manually
enable on a per-personality basis.

Set BLK_FEAT_ATOMIC_WRITES manually for the intermediate target queue
limits from the stacked device to get atomic writes working.

Fixes: 3194e36488 ("dm-table: atomic writes support")
Cc: stable@vger.kernel.org	# v6.14
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 12:02:39 +02:00
Kevin Wolf 7734fb4ad9 dm mpath: Interface for explicit probing of active paths
Multipath cannot directly provide failover for ioctls in the kernel
because it doesn't know what each ioctl means and which result could
indicate a path error. Userspace generally knows what the ioctl it
issued means and if it might be a path error, but neither does it know
which path the ioctl took nor does it necessarily have the privileges to
fail a path using the control device.

In order to allow userspace to address this situation, implement a
DM_MPATH_PROBE_PATHS ioctl that prompts the dm-mpath driver to probe all
active paths in the current path group to see whether they still work,
and fail them if not. If this returns success, userspace can retry the
ioctl and expect that the previously hit bad path is now failed (or
working again).

The immediate motivation for this is the use of SG_IO in QEMU for SCSI
passthrough. Following a failed SG_IO ioctl, QEMU will trigger probing
to ensure that all active paths are actually alive, so that retrying
SG_IO at least has a lower chance of failing due to a path error.
However, the problem is broader than just SG_IO (it affects any ioctl),
and if applications need failover support for other ioctls, the same
probing can be used.

This is not implemented on the DM control device, but on the DM mpath
block devices, to allow all users who have access to such a block device
to make use of this interface, specifically to implement failover for
ioctls. For the same reason, it is also unprivileged. Its implementation
is effectively just a bunch of reads, which could already be issued by
userspace, just without any guarantee that all the rights paths are
selected.

The probing implemented here is done fully synchronously path by path;
probing all paths concurrently is left as an improvement for the future.

Co-developed-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:06 +02:00
Kevin Wolf 4862c8861d dm: Allow .prepare_ioctl to handle ioctls directly
This adds a 'bool *forward' parameter to .prepare_ioctl, which allows
device mapper targets to accept ioctls to themselves instead of the
underlying device. If the target already fully handled the ioctl, it
sets *forward to false and device mapper won't forward it to the
underlying device any more.

In order for targets to actually know what the ioctl is about and how to
handle it, pass also cmd and arg.

As long as targets restrict themselves to interpreting ioctls of type
DM_IOCTL, this is a backwards compatible change because previously, any
such ioctl would have been passed down through all device mapper layers
until it reached a device that can't understand the ioctl and would
return an error.

Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Reviewed-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 13e79076c8 dm-flakey: make corrupting read bios work
dm-flakey corrupts the read bios in the endio function.  However, the
corrupt_bio_* functions checked bio_has_data() to see if there was data
to corrupt. Since this was the endio function, there was no data left to
complete, so bio_has_data() was always false. Fix this by saving a copy
of the bio's bi_iter in flakey_map(), and using this to initialize the
iter for corrupting the read bios. This patch also skips cloning the bio
for write bios with no data.

Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Fixes: a3998799fb ("dm flakey: add corrupt_bio_byte feature")
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 4319f0aaa2 dm-flakey: remove useless ERROR_READS check in flakey_end_io
If ERROR_READS is set, flakey_map returns DM_MAPIO_KILL for read
bios and flakey_end_io is never called, so there's no point in
checking it there. Also clean up an incorrect comment about when
read IOs are errored out.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 40ed054f39 dm-flakey: error all IOs when num_features is absent
dm-flakey would error all IOs if num_features was 0, but if it was
absent, dm-flakey would never error any IO. Fix this so that no
num_features works the same as num_features set to 0.

Fixes: aa7d7bc99f ("dm flakey: add an "error_reads" option")
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 19da6b2c9e dm-flakey: Clean up parsing messages
There were a number of cases where the error message for an invalid
table line did not match the actual problem. Fix these. Additionally,
error out when duplicate corrupt_bio_byte, random_read_corrupt, or
random_write_corrupt features are present. Also, error_reads is
incompatible with random_read_corrupt and corrupt_bio_byte with the READ
flag set, so disallow that.

Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski d90e7a500c dm: remove unneeded kvfree from alloc_targets
alloc_targets() is always called with a newly initialized table where
t->highs == NULL.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Eric Biggers 9769378133 dm-bufio: remove maximum age based eviction
Every 30 seconds, dm-bufio evicts all buffers that were not accessed
within the last max_age_seconds, except those pinned in memory via
retain_bytes.  By default max_age_seconds is 300 (i.e. 5 minutes), and
retain_bytes is 262144 (i.e. 256 KiB) per dm-bufio client.

This eviction algorithm is much too eager and is also redundant with the
shinker based eviction.

Testing on an Android phone shows that about 30 MB of dm-bufio buffers
(from dm-verity Merkle tree blocks) are loaded at boot time, and then
about 90% of them are suddenly thrown away 5 minutes after boot.  This
results in unnecessary Merkle tree I/O later.

Meanwhile, if the system actually encounters memory pressure, testing
also shows that the shrinker is effective at evicting the buffers.

Other major Linux kernel caches, such as the page cache, do not enforce
a maximum age, instead relying on the shrinker.

For these reasons, Android is now setting max_age_seconds to 86400
(i.e. 1 day), which mostly disables it; see
cadad290a7%5E%21/

That is a much better default, but really the maximum age based eviction
should not exist at all.  Let's remove it.

Note that this also eliminates the need to run work every 30 seconds,
which is beneficial too.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Eric Biggers f9ed31214e dm-verity: use softirq context only when !need_resched()
Further limit verification in softirq (a.k.a. BH) context to cases where
rescheduling of the interrupted task is not pending.

This helps prevent the CPU from spending too long in softirq context.

Note that handle_softirqs() in kernel/softirq.c already stops running
softirqs in this same case.  However, that check is too coarse-grained,
since many I/O requests can be processed in a single BLOCK_SOFTIRQ.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Mikulas Patocka abb4cf2f4c dm: lock limits when reading them
Lock queue limits when reading them, so that we don't read halfway
modified values.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-05-04 11:35:05 +02:00
Mikulas Patocka f1e24048ed dm: use generic functions instead of disable_discard and disable_write_zeroes
A small code cleanup: use blk_queue_disable_discard and
blk_queue_disable_write_zeroes instead of disable_discard and
disable_write_zeroes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 33304b75df dm-delay: don't busy-wait in kthread
When using a kthread to delay the IOs, dm-delay would continuously loop,
checking if IOs were ready to submit. It had a cond_resched() call in
the loop, but might still loop hundreds of millions of times waiting for
an IO that was scheduled to be submitted 10s of ms in the future. With
the change to make dm-delay over zoned devices always use kthreads
regardless of the length of the delay, this wasted work only gets worse.

To solve this and still keep roughly the same precision for very short
delays, dm-delay now calls fsleep() for 1/8th of the smallest non-zero
delay it will place on IOs, or 1 ms, whichever is smaller. The reason
that dm-delay doesn't just use the actual expiration time of the next
delayed IO to calculated the sleep time is that delay_dtr() must wait
for the kthread to finish before deleting the table. If a zoned device
with a long delay queued an IO shortly before being suspended and
removed, the IO would be flushed in delay_presuspend(), but the removing
the device would still have to wait for the remainder of the long delay.
This time is now capped at 1 ms.

Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski ad320ae276 dm: fix native zone append devices on top of emulated ones
If a DM device that can pass down zone append commands is stacked on top
of a device that emulates zone append commands, it will allocate zone
append emulation resources, even though it doesn't use them. This is
because the underlying device will have max_hw_zone_append_sectors set
to 0 to request zone append emulation. When the DM device is stacked on
top of it, it will inherit that max_hw_zone_append_sectors limit,
despite being able to pass down zone append bios. Solve this by making
sure max_hw_zone_append_sectors is non-zero for DM devices that do not
need zone append emulation.

Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 121218bef4 dm: limit swapping tables for devices with zone write plugs
dm_revalidate_zones() only allowed new or previously unzoned devices to
call blk_revalidate_disk_zones(). If the device was already zoned,
disk->nr_zones would always equal md->nr_zones, so dm_revalidate_zones()
returned without doing any work. This would make the zoned settings for
the device not match the new table. If the device had zone write plug
resources, it could run into errors like bdev_zone_is_seq() reading
invalid memory because disk->conv_zones_bitmap was the wrong size.

If the device doesn't have any zone write plug resources, calling
blk_revalidate_disk_zones() will always correctly update device.  If
blk_revalidate_disk_zones() fails, it can still overwrite or clear the
current disk->nr_zones value. In this case, DM must restore the previous
value of disk->nr_zones, so that the zoned settings will continue to
match the previous value that it fell back to.

If the device already has zone write plug resources,
blk_revalidate_disk_zones() will not correctly update them, if it is
called for arbitrary zoned device changes.  Since there is not much need
for this ability, the easiest solution is to disallow any table reloads
that change the zoned settings, for devices that already have zone plug
resources.  Specifically, if a device already has zone plug resources
allocated, it can only switch to another zoned table that also emulates
zone append.  Also, it cannot change the device size or the zone size. A
device can switch to an error target.

Fixes: bb37d77239 ("dm: introduce zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:35:05 +02:00
Benjamin Marzinski 37f53a2c60 dm: fix dm_blk_report_zones
If dm_get_live_table() returned NULL, dm_put_live_table() was never
called. Also, it is possible that md->zone_revalidate_map will change
while calling this function. Only read it once, so that we are always
using the same value. Otherwise we might miss a call to
dm_put_live_table().

Finally, while md->zone_revalidate_map is set and a process is calling
blk_revalidate_disk_zones() to set up the zone append emulation
resources, it is possible that another process, perhaps triggered by
blkdev_report_zones_ioctl(), will call dm_blk_report_zones(). If
blk_revalidate_disk_zones() fails, these resources can be freed while
the other process is still using them, causing a use-after-free error.

blk_revalidate_disk_zones() will only ever be called when initially
setting up the zone append emulation resources, such as when setting up
a zoned dm-crypt table for the first time. Further table swaps will not
set md->zone_revalidate_map or call blk_revalidate_disk_zones().
However it must be called using the new table (referenced by
md->zone_revalidate_map) and the new queue limits while the DM device is
suspended. dm_blk_report_zones() needs some way to distinguish between a
call from blk_revalidate_disk_zones(), which must be allowed to use
md->zone_revalidate_map to access this not yet activated table, and all
other calls to dm_blk_report_zones(), which should not be allowed while
the device is suspended and cannot use md->zone_revalidate_map, since
the zone resources might be freed by the process currently calling
blk_revalidate_disk_zones().

Solve this by tracking the process that sets md->zone_revalidate_map in
dm_revalidate_zones() and only allowing that process to make use of it
in dm_blk_report_zones().

Fixes: f211268ed1 ("dm: Use the block layer zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-05-04 11:34:56 +02:00
Dan Carpenter 650266ac4c dm: add missing unlock on in dm_keyslot_evict()
We need to call dm_put_live_table() even if dm_get_live_table() returns
NULL.

Fixes: 9355a9eb21 ("dm: support key eviction from keyslot managers of underlying devices")
Cc: stable@vger.kernel.org	# v5.12+
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-30 18:17:43 +02:00
Linus Torvalds 78109c591b - dm: always update the array size in realloc_argv on success
- dm-integrity: fix a warning on invalid table line
 
 - dm-bufio: don't schedule in atomic context
 
 - dm: Fix W=1 build with clang
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCaA+jNxQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbUzpAP0Q/RPWwWg/x31Qj119bDDkgEFBTrZd
 PqVPJzKsnZgTTQEA4AaxwgVz4pYDknrPCmHHGYKrLI0QxoVv4s7GOMrA9wc=
 =ze9q
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - always update the array size in realloc_argv on success

 - dm-integrity: fix a warning on invalid table line

 - dm-bufio: don't schedule in atomic context

 - Fix W=1 build with clang

* tag 'for-6.15/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm: always update the array size in realloc_argv on success
  dm-integrity: fix a warning on invalid table line
  dm-bufio: don't schedule in atomic context
  dm table: Fix W=1 build warning when mempool_needs_integrity is unused
2025-04-28 12:18:21 -07:00
Benjamin Marzinski 5a2a6c4281 dm: always update the array size in realloc_argv on success
realloc_argv() was only updating the array size if it was called with
old_argv already allocated. The first time it was called to create an
argv array, it would allocate the array but return the array size as
zero. dm_split_args() would think that it couldn't store any arguments
in the array and would call realloc_argv() again, causing it to
reallocate the initial slots (this time using GPF_KERNEL) and finally
return a size. Aside from being wasteful, this could cause deadlocks on
targets that need to process messages without starting new IO. Instead,
realloc_argv should always update the allocated array size on success.

Fixes: a065192655 ("dm table: don't copy from a NULL pointer in realloc_argv()")
Cc: stable@vger.kernel.org
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-28 13:11:33 +02:00
Mikulas Patocka 0a533c3e42 dm-integrity: fix a warning on invalid table line
If we use the 'B' mode and we have an invalit table line,
cancel_delayed_work_sync would trigger a warning. This commit avoids the
warning.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-04-23 13:09:15 +02:00
LongPing Wei a3d8f0a7f5 dm-bufio: don't schedule in atomic context
A BUG was reported as below when CONFIG_DEBUG_ATOMIC_SLEEP and
try_verify_in_tasklet are enabled.
[  129.444685][  T934] BUG: sleeping function called from invalid context at drivers/md/dm-bufio.c:2421
[  129.444723][  T934] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 934, name: kworker/1:4
[  129.444740][  T934] preempt_count: 201, expected: 0
[  129.444756][  T934] RCU nest depth: 0, expected: 0
[  129.444781][  T934] Preemption disabled at:
[  129.444789][  T934] [<ffffffd816231900>] shrink_work+0x21c/0x248
[  129.445167][  T934] kernel BUG at kernel/sched/walt/walt_debug.c:16!
[  129.445183][  T934] Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
[  129.445204][  T934] Skip md ftrace buffer dump for: 0x1609e0
[  129.447348][  T934] CPU: 1 PID: 934 Comm: kworker/1:4 Tainted: G        W  OE      6.6.56-android15-8-o-g6f82312b30b9-debug #1 1400000003000000474e5500b3187743670464e8
[  129.447362][  T934] Hardware name: Qualcomm Technologies, Inc. Parrot QRD, Alpha-M (DT)
[  129.447373][  T934] Workqueue: dm_bufio_cache shrink_work
[  129.447394][  T934] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  129.447406][  T934] pc : android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug]
[  129.447435][  T934] lr : __traceiter_android_rvh_schedule_bug+0x44/0x6c
[  129.447451][  T934] sp : ffffffc0843dbc90
[  129.447459][  T934] x29: ffffffc0843dbc90 x28: ffffffffffffffff x27: 0000000000000c8b
[  129.447479][  T934] x26: 0000000000000040 x25: ffffff804b3d6260 x24: ffffffd816232b68
[  129.447497][  T934] x23: ffffff805171c5b4 x22: 0000000000000000 x21: ffffffd816231900
[  129.447517][  T934] x20: ffffff80306ba898 x19: 0000000000000000 x18: ffffffc084159030
[  129.447535][  T934] x17: 00000000d2b5dd1f x16: 00000000d2b5dd1f x15: ffffffd816720358
[  129.447554][  T934] x14: 0000000000000004 x13: ffffff89ef978000 x12: 0000000000000003
[  129.447572][  T934] x11: ffffffd817a823c4 x10: 0000000000000202 x9 : 7e779c5735de9400
[  129.447591][  T934] x8 : ffffffd81560d004 x7 : 205b5d3938373434 x6 : ffffffd8167397c8
[  129.447610][  T934] x5 : 0000000000000000 x4 : 0000000000000001 x3 : ffffffc0843db9e0
[  129.447629][  T934] x2 : 0000000000002f15 x1 : 0000000000000000 x0 : 0000000000000000
[  129.447647][  T934] Call trace:
[  129.447655][  T934]  android_rvh_schedule_bug+0x0/0x8 [sched_walt_debug 1400000003000000474e550080cce8a8a78606b6]
[  129.447681][  T934]  __might_resched+0x190/0x1a8
[  129.447694][  T934]  shrink_work+0x180/0x248
[  129.447706][  T934]  process_one_work+0x260/0x624
[  129.447718][  T934]  worker_thread+0x28c/0x454
[  129.447729][  T934]  kthread+0x118/0x158
[  129.447742][  T934]  ret_from_fork+0x10/0x20
[  129.447761][  T934] Code: ???????? ???????? ???????? d2b5dd1f (d4210000)
[  129.447772][  T934] ---[ end trace 0000000000000000 ]---

dm_bufio_lock will call spin_lock_bh when try_verify_in_tasklet
is enabled, and __scan will be called in atomic context.

Fixes: 7cd326747f ("dm bufio: remove dm_bufio_cond_resched()")
Signed-off-by: LongPing Wei <weilongping@oppo.com>
Cc: stable@vger.kernel.org
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-23 13:09:10 +02:00
Linus Torvalds be913e7c40 gcc-15: get rid of misc extra NUL character padding
This removes two cases of explicit NUL padding that now causes warnings
because of '-Wunterminated-string-initialization' being part of -Wextra
in gcc-15.

Gcc is being silly in this case when it says that it truncates a NUL
terminator, because in these cases there were _multiple_ NUL characters.

But we can get rid of the warning by just simplifying the two
initializers that trigger the warning for me, so this does exactly that.

I'm not sure why the power supply code did that odd

    .attr_name = #_name "\0",

pattern: it was introduced in commit 2cabeaf151 ("power: supply: core:
Cleanup power supply sysfs attribute list"), but that 'attr_name[]'
field is an explicitly sized character array in a statically initialized
variable, and a string initializer always has a terminating NUL _and_
statically initialized character arrays are zero-padded anyway, so it
really seems to be rather extraneous belt-and-suspenders.

The zero_uuid[16] initialization in drivers/md/bcache/super.c makes
perfect sense, but it isn't necessary for the same reasons, and not
worth the new gcc warning noise.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2025-04-20 11:57:54 -07:00
Linus Torvalds f7c2ca2584 block-6.15-20250417
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmgBYN8QHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprhiD/9LnoTEqPdDGaE/f/1yUgYcJL8IUSMTfpyB
 JajQp5klWNHmIyD/GdzFq+6SL/XDAllO/NgVdQlI+78s5GRn7A/fy3unmB3kYhs/
 Spz9reD7/wH6lp2/u5jKD0Dk3Wz9LCGAUxQ2QYtd9lXJo/Dem3roBpty6/GYvLQy
 3kSwa3e4dekd9jBZ+lbnSaFcvQg3Xc/x+SoP3r60wrMIEOyJrHLHWhLMolC/ZkGw
 sl1nvj4dnRAK77G7KPctYIu6K7ryJwQLJhBre7t5Fd4Dzn46l/sNwOkBn7hhdaTR
 e3+F7C1D22zIHFrknkm1+9KkZA/9tIz1CUYRlYCxGPsH1XP4dy78uTk7sgGORV9C
 0gFJ3nqzSu0LP3Mk06e2DH+Oqq0wtdnggxmAXjJhah9JFrP7H9bEi4lTEsJ6XjLV
 PCL4PYGEkrJp7faD0p2icq6GKwx/EINlCob6Cx0h+lNo/Crz0FjkPNZkLTiYLahc
 S8Wlc6xMiMtRxdH3LX8ptAGot2s3uTQiNIKmkPkzIiSDoUoZbao1oknm8tpmXa1x
 Wg6bmOj5Jbd1K+Gyu24rIxW7RVkXtfB63o5ScRu+YGXhulsnV2mCPXZ2qxlW3s51
 zZcHUNQPAgmBdf/qzkNbk4fPS2G1rC6eJOLn84B4E5PWbP0xFjv6FdEwPF/ovdb8
 aIyR3vSjyA==
 =YCi8
 -----END PGP SIGNATURE-----

Merge tag 'block-6.15-20250417' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull via Yu:
      - fix raid10 missing discard IO accounting (Yu Kuai)
      - fix bitmap stats for bitmap file (Zheng Qixing)
      - fix oops while reading all member disks failed during
        check/repair (Meir Elisha)

 - NVMe pull via Christoph:
      - fix scan failure for non-ANA multipath controllers (Hannes
        Reinecke)
      - fix multipath sysfs links creation for some cases (Hannes
        Reinecke)
      - PCIe endpoint fixes (Damien Le Moal)
      - use NULL instead of 0 in the auth code (Damien Le Moal)

 - Various ublk fixes:
      - Slew of selftest additions
      - Improvements and fixes for IO cancelation
      - Tweak to Kconfig verbiage

 - Fix for page dirtying for blk integrity mapped pages

 - loop fixes:
      - buffered IO fix
      - uevent fixes
      - request priority inheritance fix

 - Various little fixes

* tag 'block-6.15-20250417' of git://git.kernel.dk/linux: (38 commits)
  selftests: ublk: add generic_06 for covering fault inject
  ublk: simplify aborting ublk request
  ublk: remove __ublk_quiesce_dev()
  ublk: improve detection and handling of ublk server exit
  ublk: move device reset into ublk_ch_release()
  ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
  ublk: add ublk_force_abort_dev()
  ublk: properly serialize all FETCH_REQs
  selftests: ublk: move creating UBLK_TMP into _prep_test()
  selftests: ublk: add test_stress_05.sh
  selftests: ublk: support user recovery
  selftests: ublk: support target specific command line
  selftests: ublk: increase max nr_queues and queue depth
  selftests: ublk: set queue pthread's cpu affinity
  selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
  selftests: ublk: add two stress tests for zero copy feature
  selftests: ublk: run stress tests in parallel
  selftests: ublk: make sure _add_ublk_dev can return in sub-shell
  selftests: ublk: cleanup backfile automatically
  selftests: ublk: add io_uring uapi header
  ...
2025-04-18 09:21:14 -07:00
Meir Elisha b7c178d9e5 md/raid1: Add check for missing source disk in process_checks()
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.

If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.

Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-04-16 18:06:37 +08:00
Benjamin Marzinski 4ea30ec6fb dm: handle failures in dm_table_set_restrictions
If dm_table_set_restrictions() fails while swapping tables,
device-mapper will continue using the previous table. It must be sure to
leave the mapped_device in it's previous state on failure.  Otherwise
device-mapper could end up using the old table with settings from the
unused table.

Do not update the mapped device in dm_set_zones_restrictions(). Wait
till after dm_table_set_restrictions() is sure to succeed to update the
md zoned settings. Do the same with the dax settings, and if
dm_revalidate_zones() fails, restore the original queue limits.

Fixes: 7f91ccd8a6 ("dm: Call dm_revalidate_zones() after setting the queue limits")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-11 13:38:53 +02:00
Benjamin Marzinski e8819e7f03 dm: free table mempools if not used in __bind
With request-based dm, the mempools don't need reloading when switching
tables, but the unused table mempools are not freed until the active
table is finally freed. Free them immediately if they are not needed.

Fixes: 29dec90a0f ("dm: fix bio_set allocation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-11 13:38:50 +02:00
Benjamin Marzinski 9eb7109a5b dm: don't change md if dm_table_set_restrictions() fails
__bind was changing the disk capacity, geometry and mempools of the
mapped device before calling dm_table_set_restrictions() which could
fail, forcing dm to drop the new table. Failing here would leave the
device using the old table but with the wrong capacity and mempools.

Move dm_table_set_restrictions() earlier in __bind(). Since it needs the
capacity to be set, save the old version and restore it on failure.

Fixes: bb37d77239 ("dm: introduce zone append emulation")
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Tested-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Benjamin Marzinski <bmarzins@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-11 13:38:39 +02:00
Andy Shevchenko 7bd47be161 dm table: Fix W=1 build warning when mempool_needs_integrity is unused
The mempool_needs_integrity is unused. This, in particular, prevents
kernel builds with Clang, `make W=1` and CONFIG_WERROR=y:

drivers/md/dm-table.c:1052:7: error: variable 'mempool_needs_integrity' set but not used [-Werror,-Wunused-but-set-variable]
 1052 |         bool mempool_needs_integrity = t->integrity_supported;
      |              ^

Fix this by removing the leftover.

Fixes: 105ca2a2c2 ("block: split struct bio_integrity_payload")
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-04-09 15:56:59 +02:00
Linus Torvalds 97c484ccb8 CRC cleanups for 6.15
Finish cleaning up the CRC kconfig options by removing the remaining
 unnecessary prompts and an unnecessary 'default y', removing
 CONFIG_LIBCRC32C, and documenting all the CRC library options.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ/P7QhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOKyoOAQCynFcS1dWuD27S+SdUREmBjMAoZo5M
 zdsIvlPv9KLycgD/QX5lXjW3KIYY6jQ8vHUuLVwfDl/JEp4GJS9dLGU+agg=
 =0R1T
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC cleanups from Eric Biggers:
 "Finish cleaning up the CRC kconfig options by removing the remaining
  unnecessary prompts and an unnecessary 'default y', removing
  CONFIG_LIBCRC32C, and documenting all the CRC library options"

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
  lib/crc: remove CONFIG_LIBCRC32C
  lib/crc: document all the CRC library kconfig options
  lib/crc: remove unnecessary prompt for CONFIG_CRC_ITU_T
  lib/crc: remove unnecessary prompt for CONFIG_CRC_T10DIF
  lib/crc: remove unnecessary prompt for CONFIG_CRC16
  lib/crc: remove unnecessary prompt for CONFIG_CRC_CCITT
  lib/crc: remove unnecessary prompt for CONFIG_CRC32 and drop 'default y'
2025-04-08 12:09:28 -07:00
Zheng Qixing 6ec1f02394 md/md-bitmap: fix stats collection for external bitmaps
The bitmap_get_stats() function incorrectly returns -ENOENT for external
bitmaps.

Remove the external bitmap check as the statistics should be available
regardless of bitmap storage location.

Return -EINVAL only for invalid bitmap with no storage (neither in
superblock nor in external file).

Note: "bitmap_info.external" here refers to a bitmap stored in a separate
file (bitmap_file), not to external metadata.

Fixes: 8d28d0ddb9 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-04-06 12:55:13 +08:00
Yu Kuai d05af90d62 md/raid10: fix missing discard IO accounting
md_account_bio() is not called from raid10_handle_discard(), now that we
handle bitmap inside md_account_bio(), also fix missing
bitmap_startwrite for discard.

Test whole disk discard for 20G raid10:

Before:
Device   d/s     dMB/s   drqm/s  %drqm d_await dareq-sz
md0    48.00     16.00     0.00   0.00    5.42   341.33

After:
Device   d/s     dMB/s   drqm/s  %drqm d_await dareq-sz
md0    68.00  20462.00     0.00   0.00    2.65 308133.65

Link: https://lore.kernel.org/linux-raid/20250325015746.3195035-1-yukuai1@huaweicloud.com
Fixes: 528bc2cf2f ("md/raid10: enable io accounting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
2025-04-06 12:53:12 +08:00
Thomas Gleixner 8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Eric Biggers b261d22220 lib/crc: remove CONFIG_LIBCRC32C
Now that LIBCRC32C does nothing besides select CRC32, make every option
that selects LIBCRC32C instead select CRC32 directly.  Then remove
LIBCRC32C.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Martin K. Petersen" <martin.petersen@oracle.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250401221600.24878-8-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-04-04 11:31:42 -07:00
Linus Torvalds 5014bebee0 - dm-crypt: switch to using the crc32 library
- dm-verity, dm-integrity, dm-crypt: documentation improvement
 
 - dm-vdo fixes
 
 - dm-stripe: enable inline crypto passthrough
 
 - dm-integrity: set ti->error on memory allocation failure
 
 - dm-bufio: remove unused return value
 
 - dm-verity: do forward error correction on metadata I/O errors
 
 - dm: fix unconditional IO throttle caused by REQ_PREFLUSH
 
 - dm cache: prevent BUG_ON by blocking retries on failed device resumes
 
 - dm cache: support shrinking the origin device
 
 - dm: restrict dm device size to 2^63-512 bytes
 
 - dm-delay: support zoned devices
 
 - dm-verity: support block number limits for different ioprio classes
 
 - dm-integrity: fix non-constant-time tag verification (security bug)
 
 - dm-verity, dm-ebs: fix prefetch-vs-suspend race
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ+u7shQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbZ0JAQDVhbl77u9jjPWjxJvFodMAqw+KPXGC
 MNzkyzG0lu7oPAEA33vt5pHQtr7F3SJj/sDBuZ+rb5bvUtgxeGqpJOQpTAk=
 =tj00
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mikulas Patocka:

 - dm-crypt: switch to using the crc32 library

 - dm-verity, dm-integrity, dm-crypt: documentation improvement

 - dm-vdo fixes

 - dm-stripe: enable inline crypto passthrough

 - dm-integrity: set ti->error on memory allocation failure

 - dm-bufio: remove unused return value

 - dm-verity: do forward error correction on metadata I/O errors

 - dm: fix unconditional IO throttle caused by REQ_PREFLUSH

 - dm cache: prevent BUG_ON by blocking retries on failed device resumes

 - dm cache: support shrinking the origin device

 - dm: restrict dm device size to 2^63-512 bytes

 - dm-delay: support zoned devices

 - dm-verity: support block number limits for different ioprio classes

 - dm-integrity: fix non-constant-time tag verification (security bug)

 - dm-verity, dm-ebs: fix prefetch-vs-suspend race

* tag 'for-6.15/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (27 commits)
  dm-ebs: fix prefetch-vs-suspend race
  dm-verity: fix prefetch-vs-suspend race
  dm-integrity: fix non-constant-time tag verification
  dm-verity: support block number limits for different ioprio classes
  dm-delay: support zoned devices
  dm: restrict dm device size to 2^63-512 bytes
  dm cache: support shrinking the origin device
  dm cache: prevent BUG_ON by blocking retries on failed device resumes
  dm vdo indexer: reorder uds_request to reduce padding
  dm: fix unconditional IO throttle caused by REQ_PREFLUSH
  dm vdo: rework processing of loaded refcount byte arrays
  dm vdo: remove remaining ring references
  dm-verity: do forward error correction on metadata I/O errors
  dm-bufio: remove unused return value
  dm-integrity: set ti->error on memory allocation failure
  dm: Enable inline crypto passthrough for striped target
  dm vdo slab-depot: read refcount blocks in large chunks at load time
  dm vdo vio-pool: allow variable-sized metadata vios
  dm vdo vio-pool: support pools with multiple data blocks per vio
  dm vdo vio-pool: add a pool pointer to pooled_vio
  ...
2025-04-02 21:27:59 -07:00
Mikulas Patocka 9c56542878 dm-ebs: fix prefetch-vs-suspend race
There's a possible race condition in dm-ebs - dm bufio prefetch may be in
progress while the device is suspended. Fix this by calling
dm_bufio_client_reset in the postsuspend hook.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-03-28 18:26:11 +01:00
Mikulas Patocka 2de510fccb dm-verity: fix prefetch-vs-suspend race
There's a possible race condition in dm-verity - the prefetch work item
may race with suspend and it is possible that prefetch continues to run
while the device is suspended. Fix this by calling flush_workqueue and
dm_bufio_client_reset in the postsuspend hook.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-03-28 18:26:11 +01:00
Jo Van Bulck 8bde1033f9 dm-integrity: fix non-constant-time tag verification
When using dm-integrity in standalone mode with a keyed hmac algorithm,
integrity tags are calculated and verified internally.

Using plain memcmp to compare the stored and computed tags may leak the
position of the first byte mismatch through side-channel analysis,
allowing to brute-force expected tags in linear time (e.g., by counting
single-stepping interrupts in confidential virtual machine environments).

Co-developed-by: Luca Wilke <work@luca-wilke.com>
Signed-off-by: Luca Wilke <work@luca-wilke.com>
Signed-off-by: Jo Van Bulck <jo.vanbulck@cs.kuleuven.be>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-03-28 18:25:42 +01:00
LongPing Wei 5c5d0d7050 dm-verity: support block number limits for different ioprio classes
Calling verity_verify_io in bh for IO of all sizes is not suitable for
embedded devices. From our tests, it can improve the performance of 4K
synchronise random reads.
For example:
./fio --name=rand_read --ioengine=psync --rw=randread --bs=4K \
 --direct=1 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

But it will degrade the performance of 512K synchronise sequential reads
on our devices.
For example:
./fio --name=read --ioengine=psync --rw=read --bs=512K --direct=1 \
 --numjobs=8 --runtime=60 --time_based --group_reporting \
 --filename=/dev/block/mapper/xx-verity

A parameter array is introduced by this change. And users can modify the
default config by /sys/module/dm_verity/parameters/use_bh_bytes.

The default limits for NONE/RT/BE is set to 8192.
The default limits for IDLE is set to 0.

Call verity_verify_io directly when verity_end_io is not in hardirq.

Signed-off-by: LongPing Wei <weilongping@oppo.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-03-28 11:32:55 +01:00
Linus Torvalds 9b960d8cd6 for-6.15/block-20250322
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmfe8BkQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpvTqD/4pOeGi/QfLyocn4TcJcidRGZAvBxecTVuM
 upeyr+dCyCi9Wk+EJKeAFooGe15upzxDxKj06HhCixaLx4etDK78uGV4FMM1Z4oa
 2dtchz1Zd0HyBPgQIUY8OuOgbS7tstMS/KdvL+gr5IjfapeTF+54WVLCD8eVyvO/
 vUIppgJBhrqy2qui4xF2lw4t2COt+/PqinGQuYALn4V4Po9NWA7lSh3ZI4F/byj1
 v68jXyt2fqCAyxwkzRDv4GxhN8c6W+TPJpzivrEAuSkLacovESKztinOrafrBnLR
 zdyO4n0V0yGOXbAcxRbADVA4HUsqhLl4JRnvE5P5zIaD7rkE0UqggF7vrSeCvVA1
 hsi1BhkAMNimKX7CZMnT3dJpxRQj1eDJxpwUAusLHWjMyQbNFhV7WAtthMtVJon8
 lAS4e5+xzjqKhF15GpVg5Lzy8SAwdqgNXwwq2zbM8OaPKG0FpajG8DXAqqcj4fpy
 WXnwg72KZDmRcSNJhVZK6B9xSAwIMXPgH4ClCMP9/xlw8EDpM38MDmzrs35TAVtI
 HGE3Qv9CjFjVj/OG3el+bTGIQJFVgYEVPV5TYfNCpKoxpj5cLn5OQY5u6MJawtgK
 HeDgKv3jw3lHatDALMVfwJqqVlUht0R6SIxtP9WHV+CcFrqN1LJKmdhDQbm7b4XK
 EbbawIsdxw==
 =Ci5m
 -----END PGP SIGNATURE-----

Merge tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - Fixes for integrity handling

 - NVMe pull request via Keith:
      - Secure concatenation for TCP transport (Hannes)
      - Multipath sysfs visibility (Nilay)
      - Various cleanups (Qasim, Baruch, Wang, Chen, Mike, Damien, Li)
      - Correct use of 64-bit BARs for pci-epf target (Niklas)
      - Socket fix for selinux when used in containers (Peijie)

 - MD pull request via Yu:
      - fix recovery can preempt resync (Li Nan)
      - fix md-bitmap IO limit (Su Yue)
      - fix raid10 discard with REQ_NOWAIT (Xiao Ni)
      - fix raid1 memory leak (Zheng Qixing)
      - fix mddev uaf (Yu Kuai)
      - fix raid1,raid10 IO flags (Yu Kuai)
      - some refactor and cleanup (Yu Kuai)

 - Series cleaning up and fixing bugs in the bad block handling code

 - Improve support for write failure simulation in null_blk

 - Various lock ordering fixes

 - Fixes for locking for debugfs attributes

 - Various ublk related fixes and improvements

 - Cleanups for blk-rq-qos wait handling

 - blk-throttle fixes

 - Fixes for loop dio and sync handling

 - Fixes and cleanups for the auto-PI code

 - Block side support for hardware encryption keys in blk-crypto

 - Various cleanups and fixes

* tag 'for-6.15/block-20250322' of git://git.kernel.dk/linux: (105 commits)
  nvmet: replace max(a, min(b, c)) by clamp(val, lo, hi)
  nvme-tcp: fix selinux denied when calling sock_sendmsg
  nvmet: pci-epf: Always configure BAR0 as 64-bit
  nvmet: Remove duplicate uuid_copy
  nvme: zns: Simplify nvme_zone_parse_entry()
  nvmet: pci-epf: Remove redundant 'flush_workqueue()' calls
  nvmet-fc: Remove unused functions
  nvme-pci: remove stale comment
  nvme-fc: Utilise min3() to simplify queue count calculation
  nvme-multipath: Add visibility for queue-depth io-policy
  nvme-multipath: Add visibility for numa io-policy
  nvme-multipath: Add visibility for round-robin io-policy
  nvmet: add tls_concat and tls_key debugfs entries
  nvmet-tcp: support secure channel concatenation
  nvmet: Add 'sq' argument to alloc_ctrl_args
  nvme-fabrics: reset admin connection for secure concatenation
  nvme-tcp: request secure channel concatenation
  nvme-keyring: add nvme_tls_psk_refresh()
  nvme: add nvme_auth_derive_tls_psk()
  nvme: add nvme_auth_generate_digest()
  ...
2025-03-26 18:08:55 -07:00
Linus Torvalds ee6740fd34 CRC updates for 6.15
Another set of improvements to the kernel's CRC (cyclic redundancy
 check) code:
 
 - Rework the CRC64 library functions to be directly optimized, like what
   I did last cycle for the CRC32 and CRC-T10DIF library functions.
 
 - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
   support and acceleration for crc64_be and crc64_nvme.
 
 - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
   crc_t10dif, crc64_be, and crc64_nvme.
 
 - Remove crc_t10dif and crc64_rocksoft from the crypto API, since they
   are no longer needed there.
 
 - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect.
 
 - Add kunit test cases for crc64_nvme and crc7.
 
 - Eliminate redundant functions for calculating the Castagnoli CRC32,
   settling on just crc32c().
 
 - Remove unnecessary prompts from some of the CRC kconfig options.
 
 - Further optimize the x86 crc32c code.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQSacvsUNc7UX4ntmEPzXCl4vpKOKwUCZ+CGGhQcZWJpZ2dlcnNA
 Z29vZ2xlLmNvbQAKCRDzXCl4vpKOK3wRAP4tbnzawUmlIHIF0hleoADXehUgAhMt
 NZn15mGvyiuwIQEA8W9qvnLdFXZkdxhxAEvDDFjyrRauL6eGtr/GvCx4AQY=
 =wmKG
 -----END PGP SIGNATURE-----

Merge tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull CRC updates from Eric Biggers:
 "Another set of improvements to the kernel's CRC (cyclic redundancy
  check) code:

   - Rework the CRC64 library functions to be directly optimized, like
     what I did last cycle for the CRC32 and CRC-T10DIF library
     functions

   - Rewrite the x86 PCLMULQDQ-optimized CRC code, and add VPCLMULQDQ
     support and acceleration for crc64_be and crc64_nvme

   - Rewrite the riscv Zbc-optimized CRC code, and add acceleration for
     crc_t10dif, crc64_be, and crc64_nvme

   - Remove crc_t10dif and crc64_rocksoft from the crypto API, since
     they are no longer needed there

   - Rename crc64_rocksoft to crc64_nvme, as the old name was incorrect

   - Add kunit test cases for crc64_nvme and crc7

   - Eliminate redundant functions for calculating the Castagnoli CRC32,
     settling on just crc32c()

   - Remove unnecessary prompts from some of the CRC kconfig options

   - Further optimize the x86 crc32c code"

* tag 'crc-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux: (36 commits)
  x86/crc: drop the avx10_256 functions and rename avx10_512 to avx512
  lib/crc: remove unnecessary prompt for CONFIG_CRC64
  lib/crc: remove unnecessary prompt for CONFIG_LIBCRC32C
  lib/crc: remove unnecessary prompt for CONFIG_CRC8
  lib/crc: remove unnecessary prompt for CONFIG_CRC7
  lib/crc: remove unnecessary prompt for CONFIG_CRC4
  lib/crc7: unexport crc7_be_syndrome_table
  lib/crc_kunit.c: update comment in crc_benchmark()
  lib/crc_kunit.c: add test and benchmark for crc7_be()
  x86/crc32: optimize tail handling for crc32c short inputs
  riscv/crc64: add Zbc optimized CRC64 functions
  riscv/crc-t10dif: add Zbc optimized CRC-T10DIF function
  riscv/crc32: reimplement the CRC32 functions using new template
  riscv/crc: add "template" for Zbc optimized CRC functions
  x86/crc: add ANNOTATE_NOENDBR to suppress objtool warnings
  x86/crc32: improve crc32c_arch() code generation with clang
  x86/crc64: implement crc64_be and crc64_nvme using new template
  x86/crc-t10dif: implement crc_t10dif using new template
  x86/crc32: implement crc32_le using new template
  x86/crc: add "template" for [V]PCLMULQDQ based CRC functions
  ...
2025-03-25 18:33:04 -07:00
Christoph Hellwig d43929ef65 dm-delay: support zoned devices
Add support for zoned device by passing through report_zoned to the
underlying read device.

This is required to make enable xfstests xfs/311 on zoned devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-03-24 18:28:32 +01:00
Linus Torvalds b35233e7bf - dm-flakey: fix memory corruption in optional corrupt_bio_byte feature
-----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ9QgCxQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbQOXAP0dT6GGRIeW02cWgBeDAT4dmCOMyZhV
 Y4QYbBWnfpm97QEAmuEkUrw8apznS9kJ+n/O4NFbBa3gWJJ0K80HnPxX8wE=
 =HWcU
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fix from Mikulas Patocka:

 - dm-flakey: fix memory corruption in optional corrupt_bio_byte feature

* tag 'for-6.14/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-flakey: Fix memory corruption in optional corrupt_bio_byte feature
2025-03-14 11:31:57 -10:00
Mikulas Patocka 45fc728515 dm: restrict dm device size to 2^63-512 bytes
The devices with size >= 2^63 bytes can't be used reliably by userspace
because the type off_t is a signed 64-bit integer.

Therefore, we limit the maximum size of a device mapper device to
2^63-512 bytes.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-03-14 13:58:33 +01:00
Kent Overstreet 57e9417f69 dm-flakey: Fix memory corruption in optional corrupt_bio_byte feature
Fix memory corruption due to incorrect parameter being passed to bio_init

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# v6.5+
Fixes: 1d9a943898 ("dm flakey: clone pages on write bio before corrupting them")
2025-03-13 18:54:11 +01:00
Jens Axboe 017ff379b6 Merge tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux into for-6.15/block
Merge MD changes from Yu:

"- fix recovery can preempt resync (Li Nan)
 - fix md-bitmap IO limit (Su Yue)
 - fix raid10 discard with REQ_NOWAIT (Xiao Ni)
 - fix raid1 memory leak (Zheng Qixing)
 - fix mddev uaf (Yu Kuai)
 - fix raid1,raid10 IO flags (Yu Kuai)
 - some refactor and cleanup (Yu Kuai)"

* tag 'md-6.15-20250312' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
  md/raid10: wait barrier before returning discard request with REQ_NOWAIT
  md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb
  md/raid1,raid10: don't ignore IO flags
  md/raid5: merge reshape_progress checking inside get_reshape_loc()
  md: fix mddev uaf while iterating all_mddevs list
  md: switch md-cluster to use md_submodle_head
  md: don't export md_cluster_ops
  md/md-cluster: cleanup md_cluster_ops reference
  md: switch personalities to use md_submodule_head
  md: introduce struct md_submodule_head and APIs
  md: only include md-cluster.h if necessary
  md: merge common code into find_pers()
  md/raid1: fix memory leak in raid1_run() if no active rdev
  md: ensure resync is prioritized over recovery
2025-03-13 05:34:51 -06:00
Ming-Hung Tsai c2662b1544 dm cache: support shrinking the origin device
This patch introduces formal support for shrinking the cache origin by
reducing the cache target length via table reloads. Cache blocks mapped
beyond the new target length must be clean and are invalidated during
preresume. If any dirty blocks exist in the area being removed, the
preresume operation fails without setting the NEEDS_CHECK flag in
superblock, and the resume ioctl returns EFBIG. The cache device remains
suspended until a table reload with target length that fits existing
mappings is performed.

Without this patch, reducing the cache target length could result in
io errors (RHBZ: 2134334), out-of-bounds memory access to the discard
bitset, and security concerns regarding data leakage.

Verification steps:

1. create a cache metadata with some cached blocks mapped to the tail
   of the origin device. Here we use cache_restore v1.0 to build a
   metadata with one clean block mapped to the last origin block.

cat <<EOF >> cmeta.xml
<superblock uuid="" block_size="128" nr_cache_blocks="512" \
policy="smq" hint_width="4">
  <mappings>
    <mapping cache_block="0" origin_block="4095" dirty="false"/>
  </mappings>
</superblock>
EOF
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2
dmsetup remove cmeta

2. bring up the cache whilst shrinking the cache origin by one block:

dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
dmsetup create cdata --table "0 65536 linear /dev/sdc 8192"
dmsetup create corig --table "0 524160 linear /dev/sdc 262144"
dmsetup create cache --table "0 524160 cache /dev/mapper/cmeta \
/dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0"

3. check the number of cached data blocks via dmsetup status. It is
   expected to be zero.

dmsetup status cache | cut -d ' ' -f 7

In addition to the script above, this patch can be verified using the
"cache/resize" tests in dmtest-python:

./dmtest run --rx cache/resize/shrink_origin --result-set default

Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-03-06 16:13:47 +01:00
Ming-Hung Tsai 5da692e226 dm cache: prevent BUG_ON by blocking retries on failed device resumes
A cache device failing to resume due to mapping errors should not be
retried, as the failure leaves a partially initialized policy object.
Repeating the resume operation risks triggering BUG_ON when reloading
cache mappings into the incomplete policy object.

Reproduce steps:

1. create a cache metadata consisting of 512 or more cache blocks,
   with some mappings stored in the first array block of the mapping
   array. Here we use cache_restore v1.0 to build the metadata.

cat <<EOF >> cmeta.xml
<superblock uuid="" block_size="128" nr_cache_blocks="512" \
policy="smq" hint_width="4">
  <mappings>
    <mapping cache_block="0" origin_block="0" dirty="false"/>
  </mappings>
</superblock>
EOF
dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
cache_restore -i cmeta.xml -o /dev/mapper/cmeta --metadata-version=2
dmsetup remove cmeta

2. wipe the second array block of the mapping array to simulate
   data degradations.

mapping_root=$(dd if=/dev/sdc bs=1c count=8 skip=192 \
2>/dev/null | hexdump -e '1/8 "%u\n"')
ablock=$(dd if=/dev/sdc bs=1c count=8 skip=$((4096*mapping_root+2056)) \
2>/dev/null | hexdump -e '1/8 "%u\n"')
dd if=/dev/zero of=/dev/sdc bs=4k count=1 seek=$ablock

3. try bringing up the cache device. The resume is expected to fail
   due to the broken array block.

dmsetup create cmeta --table "0 8192 linear /dev/sdc 0"
dmsetup create cdata --table "0 65536 linear /dev/sdc 8192"
dmsetup create corig --table "0 524288 linear /dev/sdc 262144"
dmsetup create cache --notable
dmsetup load cache --table "0 524288 cache /dev/mapper/cmeta \
/dev/mapper/cdata /dev/mapper/corig 128 2 metadata2 writethrough smq 0"
dmsetup resume cache

4. try resuming the cache again. An unexpected BUG_ON is triggered
   while loading cache mappings.

dmsetup resume cache

Kernel logs:

(snip)
------------[ cut here ]------------
kernel BUG at drivers/md/dm-cache-policy-smq.c:752!
Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
CPU: 0 UID: 0 PID: 332 Comm: dmsetup Not tainted 6.13.4 #3
RIP: 0010:smq_load_mapping+0x3e5/0x570

Fix by disallowing resume operations for devices that failed the
initial attempt.

Signed-off-by: Ming-Hung Tsai <mtsai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-03-06 16:13:30 +01:00
Zheng Qixing d301f164c3 badblocks: use sector_t instead of int to avoid truncation of badblocks length
There is a truncation of badblocks length issue when set badblocks as
follow:

echo "2055 4294967299" > bad_blocks
cat bad_blocks
2055 3

Change 'sectors' argument type from 'int' to 'sector_t'.

This change avoids truncation of badblocks length for large sectors by
replacing 'int' with 'sector_t' (u64), enabling proper handling of larger
disk sizes and ensuring compatibility with 64-bit sector addressing.

Fixes: 9e0e252a04 ("badblocks: Add core badblock management code")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/r/20250227075507.151331-13-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06 08:04:52 -07:00
Zheng Qixing 7e5102dd99 md: improve return types of badblocks handling functions
rdev_set_badblocks() only indicates success/failure, so convert its return
type from int to boolean for better semantic clarity.

rdev_clear_badblocks() return value is never used by any caller, convert it
to void. This removes unnecessary value returns.

Also update narrow_write_error() in both raid1 and raid10 to use boolean
return type to match rdev_set_badblocks().

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250227075507.151331-12-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06 08:03:28 -07:00
Zheng Qixing c8775aefba badblocks: return boolean from badblocks_set() and badblocks_clear()
Change the return type of badblocks_set() and badblocks_clear()
from int to bool, indicating success or failure. Specifically:

- _badblocks_set() and _badblocks_clear() functions now return
true for success and false for failure.
- All calls to these functions are updated to handle the new
boolean return type.
- This change improves code clarity and ensures a more consistent
handling of success and failure states.

Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Acked-by: Ira Weiny <ira.weiny@intel.com>
Link: https://lore.kernel.org/r/20250227075507.151331-11-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-06 08:03:28 -07:00
Xiao Ni 3db4404435 md/raid10: wait barrier before returning discard request with REQ_NOWAIT
raid10_handle_discard should wait barrier before returning a discard bio
which has REQ_NOWAIT. And there is no need to print warning calltrace
if a discard bio has REQ_NOWAIT flag. Quality engineer usually checks
dmesg and reports error if dmesg has warning/error calltrace.

Fixes: c9aa889b03 ("md: raid10 add nowait support")
Signed-off-by: Xiao Ni <xni@redhat.com>
Acked-by: Coly Li <colyli@kernel.org>
Link: https://lore.kernel.org/linux-raid/20250306094938.48952-1-xni@redhat.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-06 22:34:20 +08:00
Su Yue 6130825f34 md/md-bitmap: fix wrong bitmap_limit for clustermd when write sb
In clustermd, separate write-intent-bitmaps are used for each cluster
node:

0                    4k                     8k                    12k
-------------------------------------------------------------------
| idle                | md super            | bm super [0] + bits |
| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
| bm bits [3, contd]  |                     |                     |

So in node 1, pg_index in __write_sb_page() could equal to
bitmap->storage.file_pages. Then bitmap_limit will be calculated to
0. md_super_write() will be called with 0 size.
That means the first 4k sb area of node 1 will never be updated
through filemap_write_page().
This bug causes hang of mdadm/clustermd_tests/01r1_Grow_resize.

Here use (pg_index % bitmap->storage.file_pages) to make calculation
of bitmap_limit correct.

Fixes: ab99a87542 ("md/md-bitmap: fix writing non bitmap pages")
Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Link: https://lore.kernel.org/linux-raid/20250303033918.32136-1-glass.su@suse.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05 00:34:00 +08:00
Yu Kuai e879a0d9cb md/raid1,raid10: don't ignore IO flags
If blk-wbt is enabled by default, it's found that raid write performance
is quite bad because all IO are throttled by wbt of underlying disks,
due to flag REQ_IDLE is ignored. And turns out this behaviour exist since
blk-wbt is introduced.

Other than REQ_IDLE, other flags should not be ignored as well, for
example REQ_META can be set for filesystems, clearing it can cause priority
reverse problems; And REQ_NOWAIT should not be cleared as well, because
io will wait instead of failing directly in underlying disks.

Fix those problems by keep IO flags from master bio.

Fises: f51d46d0e7 ("md: add support for REQ_NOWAIT")
Fixes: e34cbd3074 ("blk-wbt: add general throttling mechanism")
Fixes: 5404bc7a87 ("[PATCH] Allow file systems to differentiate between data and meta reads")
Link: https://lore.kernel.org/linux-raid/20250227121657.832356-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05 00:32:22 +08:00
Yu Kuai 1320fe8741 md/raid5: merge reshape_progress checking inside get_reshape_loc()
During code review, it's found that other than raid5_bitmap_sector(),
reshape_progress is always checked before get_reshape_loc(), while
raid5_bitmap_sector() should check as well to prevent holding the
lock 'conf->device_lock'. Hence merge that checking inside
get_reshape_loc().

Link: https://lore.kernel.org/linux-raid/20250227120452.808503-1-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05 00:31:27 +08:00
Yu Kuai 8542870237 md: fix mddev uaf while iterating all_mddevs list
While iterating all_mddevs list from md_notify_reboot() and md_exit(),
list_for_each_entry_safe is used, and this can race with deletint the
next mddev, causing UAF:

t1:
spin_lock
//list_for_each_entry_safe(mddev, n, ...)
 mddev_get(mddev1)
 // assume mddev2 is the next entry
 spin_unlock
            t2:
            //remove mddev2
            ...
            mddev_free
            spin_lock
            list_del
            spin_unlock
            kfree(mddev2)
 mddev_put(mddev1)
 spin_lock
 //continue dereference mddev2->all_mddevs

The old helper for_each_mddev() actually grab the reference of mddev2
while holding the lock, to prevent from being freed. This problem can be
fixed the same way, however, the code will be complex.

Hence switch to use list_for_each_entry, in this case mddev_put() can free
the mddev1 and it's not safe as well. Refer to md_seq_show(), also factor
out a helper mddev_put_locked() to fix this problem.

Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-raid/20250220124348.845222-1-yukuai1@huaweicloud.com
Fixes: f265143422 ("md: stop using for_each_mddev in md_notify_reboot")
Fixes: 16648bac86 ("md: stop using for_each_mddev in md_exit")
Reported-and-tested-by: Guillaume Morin <guillaume@morinfr.org>
Closes: https://lore.kernel.org/all/Z7Y0SURoA8xwg7vn@bender.morinfr.org/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
2025-03-05 00:29:41 +08:00
Yu Kuai 87a86277c9 md: switch md-cluster to use md_submodle_head
To make code cleaner, and prepare to add kconfig for bitmap.

Also remove the unsed global variables pers_lock, md_cluster_ops and
md_cluster_mod, and exported symbols register_md_cluster_operations(),
unregister_md_cluster_operations() and md_cluster_ops.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05 00:28:39 +08:00
Yu Kuai c594de0455 md: don't export md_cluster_ops
Add a new field 'cluster_ops' and initialize it md_setup_cluster(), so
that the gloable variable 'md_cluter_ops' doesn't need to be exported.
Also prepare to switch md-cluster to use md_submod_head.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05 00:28:17 +08:00
Yu Kuai ff84e1b1d2 md/md-cluster: cleanup md_cluster_ops reference
md_cluster_ops->slot_number() is implemented inside md-cluster.c, just
call it directly.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05 00:27:47 +08:00
Yu Kuai 3d44e1d157 md: switch personalities to use md_submodule_head
Remove the global list 'pers_list', and switch to use md_submodule_head,
which is managed by xarry. Prepare to unify registration and unregistration
for all sub modules.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05 00:27:20 +08:00
Yu Kuai d3beb7c9c6 md: introduce struct md_submodule_head and APIs
Prepare to unify registration and unregistration of md personalities
and md-cluster, also prepare for add kconfig for md-bitmap.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-03-05 00:26:56 +08:00
Yu Kuai bf0a73264f md: only include md-cluster.h if necessary
md-cluster is only supportted by raid1 and raid10, there is no need to
include md-cluster.h for other personalities.

Also move APIs that is only used in md-cluster.c from md.h to
md-cluster.h.

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05 00:26:21 +08:00
Yu Kuai 9faab54897 md: merge common code into find_pers()
- pers_lock() are held and released from caller
- try_module_get() is called from caller
- error message from caller

Merge above code into find_pers(), and rename it to get_pers(), also
add a wrapper to module_put() as put_pers().

Link: https://lore.kernel.org/linux-raid/20250215092225.2427977-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Su Yue <glass.su@suse.com>
2025-03-05 00:26:08 +08:00
Christoph Hellwig 105ca2a2c2 block: split struct bio_integrity_payload
Many of the fields in struct bio_integrity_payload are only needed for
the default integrity buffer in the block layer, and the variable
sized array at the end of the structure makes it very hard to embed
into caller allocated structures.

Reduce struct bio_integrity_payload to the minimal structure needed in
common code and create two separate containing structures for the
automatically generated payload and the caller allocated payload.
The latter is a simple wrapper for struct bio_integrity_payload and
the bvecs, while the former contains the additional fields moved out
of struct bio_integrity_payload.

Always use a dedicated mempool for automatic integrity metadata
instead of depending on bio_set that is submitter controlled and thus
often doesn't have the mempool initialized and stop using mempools for
the submitter buffers as they aren't in the NOIO I/O submission path
where we need to guarantee forward progress.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Tested-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Link: https://lore.kernel.org/r/20250225154449.422989-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-03-03 11:17:52 -07:00
Ken Raeburn e678900df2 dm vdo indexer: reorder uds_request to reduce padding
Reorder fields and make uds_request_type and uds_zone_message packed,
to squeeze out some space. Use struct_group so the request reset code
no longer needs to care about field order.

On x86_64 this reduces the struct size from 144 to 120, which saves 48
kB (about 12%) per VDO hash zone.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-25 14:15:41 +01:00
Linus Torvalds 2c24478e5f - dm-vdo: add missing spin_lock_init
- dm-integrity: divide-by-zero fix
 
 - dm-integrity: do not report unused entries in the table line
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ7xnCRQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhbVueAP47A+AczQa868MpFc2p8w5AmqqF6jyW
 s303VCS9wUCtBwD/TkFcYz5AtIRvIIdZiwtOKALPekjeLeOZaOSnrdJVkw0=
 =x00V
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper fixes from Mikulas Patocka:

 - dm-vdo: add missing spin_lock_init

 - dm-integrity: divide-by-zero fix

 - dm-integrity: do not report unused entries in the table line

* tag 'for-6.14/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm vdo: add missing spin_lock_init
  dm-integrity: Do not emit journal configuration in DM table for Inline mode
  dm-integrity: Avoid divide by zero in table status in Inline mode
2025-02-24 16:29:48 -08:00
Jinliang Zheng 88f7f56d16 dm: fix unconditional IO throttle caused by REQ_PREFLUSH
When a bio with REQ_PREFLUSH is submitted to dm, __send_empty_flush()
generates a flush_bio with REQ_OP_WRITE | REQ_PREFLUSH | REQ_SYNC,
which causes the flush_bio to be throttled by wbt_wait().

An example from v5.4, similar problem also exists in upstream:

    crash> bt 2091206
    PID: 2091206  TASK: ffff2050df92a300  CPU: 109  COMMAND: "kworker/u260:0"
     #0 [ffff800084a2f7f0] __switch_to at ffff80004008aeb8
     #1 [ffff800084a2f820] __schedule at ffff800040bfa0c4
     #2 [ffff800084a2f880] schedule at ffff800040bfa4b4
     #3 [ffff800084a2f8a0] io_schedule at ffff800040bfa9c4
     #4 [ffff800084a2f8c0] rq_qos_wait at ffff8000405925bc
     #5 [ffff800084a2f940] wbt_wait at ffff8000405bb3a0
     #6 [ffff800084a2f9a0] __rq_qos_throttle at ffff800040592254
     #7 [ffff800084a2f9c0] blk_mq_make_request at ffff80004057cf38
     #8 [ffff800084a2fa60] generic_make_request at ffff800040570138
     #9 [ffff800084a2fae0] submit_bio at ffff8000405703b4
    #10 [ffff800084a2fb50] xlog_write_iclog at ffff800001280834 [xfs]
    #11 [ffff800084a2fbb0] xlog_sync at ffff800001280c3c [xfs]
    #12 [ffff800084a2fbf0] xlog_state_release_iclog at ffff800001280df4 [xfs]
    #13 [ffff800084a2fc10] xlog_write at ffff80000128203c [xfs]
    #14 [ffff800084a2fcd0] xlog_cil_push at ffff8000012846dc [xfs]
    #15 [ffff800084a2fda0] xlog_cil_push_work at ffff800001284a2c [xfs]
    #16 [ffff800084a2fdb0] process_one_work at ffff800040111d08
    #17 [ffff800084a2fe00] worker_thread at ffff8000401121cc
    #18 [ffff800084a2fe70] kthread at ffff800040118de4

After commit 2def2845cc ("xfs: don't allow log IO to be throttled"),
the metadata submitted by xlog_write_iclog() should not be throttled.
But due to the existence of the dm layer, throttling flush_bio indirectly
causes the metadata bio to be throttled.

Fix this by conditionally adding REQ_IDLE to flush_bio.bi_opf, which makes
wbt_should_throttle() return false to avoid wbt_wait().

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Tianxiang Peng <txpeng@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24 12:09:44 +01:00
Ken Raeburn dc8f646cd8 dm vdo: rework processing of loaded refcount byte arrays
Clear provisional refcount values and count free/allocated blocks in
one integrated loop. Process 8 aligned bytes at a time instead of
every byte individually.

On an Intel i7-11850H this reduces the CPU time needed to process a
loaded refcount block by a factor of about 5-6. On a large system the
refcount loading may be the largest factor in device startup time.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24 12:09:44 +01:00
Sweet Tea Dorminy ff3f7115f4 dm vdo: remove remaining ring references
Lists are the new rings, so update all remaining references to rings to
talk about lists.

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24 12:09:44 +01:00
Mikulas Patocka 51ba14fb36 dm-verity: do forward error correction on metadata I/O errors
Do forward error correction if metadata I/O fails.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
2025-02-24 12:09:41 +01:00
Ken Raeburn 36e1b81f59 dm vdo: add missing spin_lock_init
Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-02-24 12:07:02 +01:00
Mikulas Patocka 3be1f25393 dm-bufio: remove unused return value
The return value of the function "forget_buffer" is not tested, so we can
remove it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-24 11:42:23 +01:00
Mikulas Patocka 00204ae3d6 dm-integrity: set ti->error on memory allocation failure
The dm-integrity target didn't set the error string when memory
allocation failed. This patch fixes it.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
2025-02-24 11:42:09 +01:00
Linus Torvalds 8a61cb6e15 block-6.14-20250221
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAme4rUUQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpj9+EADLPOFPa9hT1PGBbpnj74vBayoTO/M+w2Gp
 +k2b8if3eGlY43WO2k+ytceWbA901iyvLPRqt1M1Ez8+BrNBg4NKcLv7q4O9NA3i
 nDPLggugSc5sdRbLRimxiwHkkpSOBenkdb7R9XGmMXTCfSbRKl0kK01ivpgkbiG4
 pbyPWYcoMyHaECBfPhazrJig4+rugXOYbkYoOM4NHsLqlTNfmowcMRPu+6czXt7q
 ITHW2RTWK3ue8q+c3nwGPDk2ZKM8X/49rA/6bvD3voLNs+jQ8KFg2KULENf0Xaq6
 1ZGrhLcr45iEHP0/+RORMzx27PqbTCSGIOTMZtwZNqh5+V+ybrGJq/F/T5rkrA3F
 QqHld/WSSKWJ10RVAyjDP7NQ5vNZTwwGAEVagjyIFEfk7G7RTY2kIpSZiUgrZ9oD
 4CkOKUGmVkUsKQW6gb0JQObtYyXXoNtmg8wQU2WwhISjFDkoYWw53LHwH/LnxLyi
 Vg182amVBmERk4I5nTUiIML/7TzS69srb0Q7yaQS3eTwzLorDaB+3tPAxQmCTkGq
 KeBfuBtbP3LTOy2Oek4YbKl8CA2KDYtK7FbCE6PECUbdTjpNrcAAgA/ZcgoKV8s7
 EHWZFx7dZyFS6LFNWzT9VhTtgSZS92JIsZwgnjSJPV2UazyxmqHFChzVVGWJbaB3
 agkMor3nVg==
 =L+Lr
 -----END PGP SIGNATURE-----

Merge tag 'block-6.14-20250221' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - NVMe pull request via Keith:
      - FC controller state check fixes (Daniel)
      - PCI Endpoint fixes (Damien)
      - TCP connection failure fixe (Caleb)
      - TCP handling C2HTermReq PDU (Maurizio)
      - RDMA queue state check (Ruozhu)
      - Apple controller fixes (Hector)
      - Target crash on disbaled namespace (Hannes)

 - MD pull request via Yu:
      - Fix queue limits error handling for raid0, raid1 and raid10

 - Fix for a NULL pointer deref in request data mapping

 - Code cleanup for request merging

* tag 'block-6.14-20250221' of git://git.kernel.dk/linux:
  nvme: only allow entering LIVE from CONNECTING state
  nvme-fc: rely on state transitions to handle connectivity loss
  apple-nvme: Support coprocessors left idle
  apple-nvme: Release power domains when probe fails
  nvmet: Use enum definitions instead of hardcoded values
  nvme: Cleanup the definition of the controller config register fields
  nvme/ioctl: add missing space in err message
  nvme-tcp: fix connect failure on receiving partial ICResp PDU
  nvme: tcp: Fix compilation warning with W=1
  nvmet: pci-epf: Avoid RCU stalls under heavy workload
  nvmet: pci-epf: Do not uselessly write the CSTS register
  nvmet: pci-epf: Correctly initialize CSTS when enabling the controller
  nvmet-rdma: recheck queue state is LIVE in state lock in recv done
  nvmet: Fix crash when a namespace is disabled
  nvme-tcp: add basic support for the C2HTermReq PDU
  nvme-pci: quirk Acer FA100 for non-uniqueue identifiers
  block: fix NULL pointer dereferenced within __blk_rq_map_sg
  block/merge: remove unnecessary min() with UINT_MAX
  md/raid*: Fix the set_queue_limits implementations
2025-02-21 09:36:28 -08:00
Ed Tsai a8b8a126c8 dm: Enable inline crypto passthrough for striped target
Added DM_TARGET_PASSES_CRYPTO feature to the striped target to utilize
the hardware encryption of the underlying storage devices, preventing
fallback to the crypto API.

Signed-off-by: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-17 11:48:32 +01:00
Milan Broz c19525b5fb dm-integrity: Do not emit journal configuration in DM table for Inline mode
The Inline mode does not use a journal; it makes no sense to print
journal information in DM table. Print it only if the journal is used.

The same applies to interleave_sectors (unused for Inline mode).

Also, add comments for arg_count, as the current calculation
is quite obscure.

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-17 11:33:41 +01:00
Milan Broz 7fb39882b2 dm-integrity: Avoid divide by zero in table status in Inline mode
In Inline mode, the journal is unused, and journal_sectors is zero.

Calculating the journal watermark requires dividing by journal_sectors,
which should be done only if the journal is configured.

Otherwise, a simple table query (dmsetup table) can cause OOPS.

This bug did not show on some systems, perhaps only due to
compiler optimization.

On my 32-bit testing machine, this reliably crashes with the following:

 : Oops: divide error: 0000 [#1] PREEMPT SMP
 : CPU: 0 UID: 0 PID: 2450 Comm: dmsetup Not tainted 6.14.0-rc2+ #959
 : EIP: dm_integrity_status+0x2f8/0xab0 [dm_integrity]
 ...

Signed-off-by: Milan Broz <gmazyland@gmail.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Fixes: fb0987682c ("dm-integrity: introduce the Inline mode")
Cc: stable@vger.kernel.org # 6.11+
2025-02-17 11:33:07 +01:00
Zheng Qixing 5fbcf76e0d md/raid1: fix memory leak in raid1_run() if no active rdev
When `raid1_set_limits()` fails or when the array has no active
`rdev`, the allocated memory for `conf` is not properly freed.

Add raid1_free() call to properly free the conf in error path.

Fixes: 799af947ed ("md/raid1: don't free conf on raid0_run failure")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250215020137.3703757-1-zhengqixing@huaweicloud.com
Singed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-02-15 19:45:58 +08:00
Li Nan 4b10a3bc67 md: ensure resync is prioritized over recovery
If a new disk is added during resync, the resync process is interrupted,
and recovery is triggered, causing the previous resync to be lost. In
reality, disk addition should not terminate resync, fix it.

Steps to reproduce the issue:
  mdadm -CR /dev/md0 -l1 -n3 -x1 /dev/sd[abcd]
  mdadm --fail /dev/md0 /dev/sdc

Fixes: 24dd469d72 ("[PATCH] md: allow a manual resync with md")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250213131530.3698600-1-linan666@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
2025-02-15 19:45:39 +08:00
Bart Van Assche fbe8f2fa97 md/raid*: Fix the set_queue_limits implementations
queue_limits_cancel_update() must only be called if
queue_limits_start_update() is called first. Remove the
queue_limits_cancel_update() calls from the raid*_set_limits() functions
because there is no corresponding queue_limits_start_update() call.

Cc: Christoph Hellwig <hch@lst.de>
Fixes: c6e56cf6b2 ("block: move integrity information into queue_limits")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-raid/20250212171108.3483150-1-bvanassche@acm.org/
Signed-off-by: Yu Kuai <yukuai@kernel.org>
2025-02-14 01:29:33 +08:00
Eric Biggers ebc4176551 blk-crypto: add basic hardware-wrapped key support
To prevent keys from being compromised if an attacker acquires read
access to kernel memory, some inline encryption hardware can accept keys
which are wrapped by a per-boot hardware-internal key.  This avoids
needing to keep the raw keys in kernel memory, without limiting the
number of keys that can be used.  Such hardware also supports deriving a
"software secret" for cryptographic tasks that can't be handled by
inline encryption; this is needed for fscrypt to work properly.

To support this hardware, allow struct blk_crypto_key to represent a
hardware-wrapped key as an alternative to a raw key, and make drivers
set flags in struct blk_crypto_profile to indicate which types of keys
they support.  Also add the ->derive_sw_secret() low-level operation,
which drivers supporting wrapped keys must implement.

For more information, see the detailed documentation which this patch
adds to Documentation/block/inline-encryption.rst.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Tested-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org> # sm8650
Link: https://lore.kernel.org/r/20250204060041.409950-2-ebiggers@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-02-10 09:54:19 -07:00
Eric Biggers 8df3682904 lib/crc32: standardize on crc32c() name for Castagnoli CRC32
For historical reasons, the Castagnoli CRC32 is available under 3 names:
crc32c(), crc32c_le(), and __crc32c_le().  Most callers use crc32c().
The more verbose versions are not really warranted; there is no "_be"
version that the "_le" version needs to be differentiated from, and the
leading underscores are pointless.

Therefore, let's standardize on just crc32c().  Remove the other two
names, and update callers accordingly.

Specifically, the new crc32c() comes from what was previously
__crc32c_le(), so compared to the old crc32c() it now takes a size_t
length rather than unsigned int, and it's now in linux/crc32.h instead
of just linux/crc32c.h (which includes linux/crc32.h).

Later patches will also rename __crc32c_le_combine(), crc32c_le_base(),
and crc32c_le_arch().

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20250208024911.14936-5-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
2025-02-08 20:06:30 -08:00
Linus Torvalds a67d0a0513 block-6.14-20250207
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmemOHsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpmE5D/0fcxzxCS9PpFIjcBfEu1jAy8q5MAxE/u63
 PQkkeKHGDUP2/8Ely7Wg1H5tiNagkDj9DHLHJEQikDpPsTCq5bv59VKSCdbsKmjC
 tyGHIxdQoc57jS17/UPCALOBQAro+eIX5YLf1av01lNJb8s1bScxseU0ETGSYkUr
 gXMQjSQuo3lggjvcZRf3LpmNKyO8h+Eh/d/3tBsijTh5nmqSvdbF+CcD560KjLTv
 3UPnT45OXK2fDd+JMxcn5CK4+tMA5Fv5VAf12GlHDgBjUMk1Q2gYEqYoxI9Kjr2E
 8WC/TGsKcyVGBhjQMNsN+zWV2Sc483n4KzcI9lksEyqgA8JbFp77sQ+JS7hypWbX
 meI+UV49grY/FzrrvZ8A3dlK2A9ktRX0G9UpOQts/i+dFaPsErZ2fihOBlEPAMBp
 fxMnYIdJiicLT6tEJno7AfRWhpVvqIPzWxPBcAip1Rgad2vLtz567rvlcOrm/pJE
 WMdVhQAs1Lt2viyJPy2JkCqUAvBt9UwGJ2Jo4mlZxpRg6Ns+mfe0m1o5uivtiu3/
 o7lYHRq6wm+HITXLXZvBe/T6/HV/b2xsxIatt47AJPMLVMf2B0JGvMTT9Ed+8LJQ
 6HLrZHPPYPLOEV2tLTIkYSJXctfpK5fkgKr3OS0M/1q4gV/fB9g7XuUALquYxM9Q
 qxZbmtho9g==
 =nffI
 -----END PGP SIGNATURE-----

Merge tag 'block-6.14-20250207' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

 - MD pull request via Song:
      - fix an error handling path for md-linear

 - NVMe pull request via Keith:
      - Connection fixes for fibre channel transport (Daniel)
      - Endian fixes (Keith, Christoph)
      - Cleanup fix for host memory buffer (Francis)
      - Platform specific power quirks (Georg)
      - Target memory leak (Sagi)
      - Use appropriate controller state accessor (Daniel)

 - Fixup for a regression introduced last week, where sunvdc wasn't
   updated for an API change, causing compilation failures on sparc64.

* tag 'block-6.14-20250207' of git://git.kernel.dk/linux:
  drivers/block/sunvdc.c: update the correct AIP call
  md: Fix linear_set_limits()
  nvme-fc: use ctrl state getter
  nvme: make nvme_tls_attrs_group static
  nvmet: add a missing endianess conversion in nvmet_execute_admin_connect
  nvmet: the result field in nvmet_alloc_ctrl_args is little endian
  nvmet: fix a memory leak in controller identify
  nvme-fc: do not ignore connectivity loss during connecting
  nvme: handle connectivity loss in nvme_set_queue_count
  nvme-fc: go straight to connecting state when initializing
  nvme-pci: Add TUXEDO IBP Gen9 to Samsung sleep quirk
  nvme-pci: Add TUXEDO InfinityFlex to Samsung sleep quirk
  nvme-pci: remove redundant dma frees in hmb
  nvmet: fix rw control endian access
2025-02-07 11:00:33 -08:00
Ken Raeburn 0ce46f4f75 dm vdo slab-depot: read refcount blocks in large chunks at load time
At startup, vdo loads all the reference count data before the device
reports that it is ready. Using a pool of large metadata vios can
improve the startup speed of vdo. The pool of large vios is released
after the device is ready.

During normal operation, reference counts are updated 4kB at a time,
as before.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:15:30 +01:00
Ken Raeburn f979da5125 dm vdo vio-pool: allow variable-sized metadata vios
With larger-sized metadata vio pools, vdo will sometimes need to
issue I/O with a smaller size than the allocated size. Since
vio_reset_bio is where the bvec array and I/O size are initialized,
this reset interface must now specify what I/O size to use.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:15:19 +01:00
Ken Raeburn 979a0fd396 dm vdo vio-pool: support pools with multiple data blocks per vio
Support pools with multiple data blocks per vio

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:15:08 +01:00
Ken Raeburn 2b515cea77 dm vdo vio-pool: add a pool pointer to pooled_vio
This allows us to simplify the return_vio_to_pool interface.

Also, we don't need to use vdo_forget on local variables or arguments
that are about to go out of scope anyway.

Signed-off-by: Ken Raeburn <raeburn@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:14:45 +01:00
Matthew Sakai 148a9cec84 dm vdo: remove checks that can not fail
Remove checks that can't fail.

Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:13:53 +01:00
Chung Chung f4e99b846c dm vdo indexer: prevent unterminated string warning
Fix array initialization that triggers a warning:

error: initializer-string for array of ‘unsigned char’ is too long
 [-Werror=unterminated-string-initialization]

Signed-off-by: Chung Chung <cchung@redhat.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:11:22 +01:00
Matthew Sakai 3280c9313c dm vdo: use a short static string for thread name prefix
Also remove MODULE_NAME and a BUG_ON check, both unneeded.

This fixes a warning about string truncation in snprintf that
will never happen in practice:

drivers/md/dm-vdo/vdo.c: In function ‘vdo_make’:
drivers/md/dm-vdo/vdo.c:564:5: error: ‘%s’ directive output may be truncated writing up to 55 bytes into a region of size 16 [-Werror=format-truncation=]
    "%s%u", MODULE_NAME, instance);
     ^~
drivers/md/dm-vdo/vdo.c:563:2: note: ‘snprintf’ output between 2 and 66 bytes into a destination of size 16
  snprintf(vdo->thread_name_prefix, sizeof(vdo->thread_name_prefix),
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    "%s%u", MODULE_NAME, instance);
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reported-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Matthew Sakai <msakai@redhat.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:10:37 +01:00
Eric Biggers f656fa4013 dm-crypt: switch to using the crc32 library
Now that the crc32() library function takes advantage of
architecture-specific optimizations, it is unnecessary to go through the
crypto API.  Just use crc32().  This is much simpler, and it improves
performance due to eliminating the crypto API overhead.  (However, this
only affects the TCW IV mode of dm-crypt, which is a compatibility mode
that is rarely used compared to other dm-crypt modes.)

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-02-03 14:10:10 +01:00
Linus Torvalds 9755ffd989 block-6.14-20250131
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmec72kQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpm0pD/sENbY3LKqN8JipH8fWjLmdcimB7STB8MpB
 LcuWiiuhDQuU1Yjr7c1h0Nkb6qwrb+gP+NQJrhlyUO7JVlvzFivJqsRscibNj+MF
 hX/vUM2zKo0La2YxRv7aygTTCs8JfRmLKkTtfo0H4geljbX8/yMKSEJA+7x4j60X
 TeJjmAHEKBGm2cNA7QMGqhzoUfjW1WQyn9x6/YOlLsYLYB8Cd7LFJqxM1tiDzUay
 ys/QJ833qctdqBK5iLIh5voqZA58N3koidw536NLNgyvJn77CSAmC5jF1Vr0kFS6
 Zb+dE9qRXDsND83dq3qvpSG5gtE42DygEG0g7s49zztd1rPOWE45RCxQtcZqHSvu
 N9hN+/4SNX+02AfrgZFEk8+yGbtIlsb0mHRxqIKh8261tb1RTYVmtI4tqC3GVMjj
 Qo41rDLDhaPOoKayo/KZO55n9U9M8BIkoF73hIIr5l9wHf5iOwi+EjzULlYhGigc
 xypMvwWFLu3Of9YCJo38fTxGO6fL4eOJihoqqxNfeRACyQqbVeLWAT4wgUr2G8la
 MIp2l5Vb7MmFXsES39fARLoN/8jjbgaXdxnSWhZlo4SPQaAlVA5ScXC4tKI3Gte3
 O7fPVAX76PrZ+y6x4enfooI3mrpkeK676BxsokOX0m8WrVyVmIc4cfSaayc1gC5m
 Uy298X93IA==
 =molg
 -----END PGP SIGNATURE-----

Merge tag 'block-6.14-20250131' of git://git.kernel.dk/linux

Pull more block updates from Jens Axboe:

 - MD pull request via Song:
      - Fix a md-cluster regression introduced

 - More sysfs race fixes

 - Mark anything inside queue freezing as not being able to do IO for
   memory allocations

 - Fix for a regression introduced in loop in this merge window

 - Fix for a regression in queue mapping setups introduced in this merge
   window

 - Fix for the block dio fops attempting an iov_iter revert upton
   getting -EIOCBQUEUED on the read side. This one is going to stable as
   well

* tag 'block-6.14-20250131' of git://git.kernel.dk/linux:
  block: force noio scope in blk_mq_freeze_queue
  block: fix nr_hw_queue update racing with disk addition/removal
  block: get rid of request queue ->sysfs_dir_lock
  loop: don't clear LO_FLAGS_PARTSCAN on LOOP_SET_STATUS{,64}
  md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
  blk-mq: create correct map for fallback case
  block: don't revert iter for -EIOCBQUEUED
2025-01-31 11:49:30 -08:00
Bart Van Assche a572593ac8 md: Fix linear_set_limits()
queue_limits_cancel_update() must only be called if
queue_limits_start_update() is called first. Remove the
queue_limits_cancel_update() call from linear_set_limits() because
there is no corresponding queue_limits_start_update() call.

This bug was discovered by annotating all mutex operations with clang
thread-safety attributes and by building the kernel with clang and
-Wthread-safety.

Cc: Yu Kuai <yukuai3@huawei.com>
Cc: Coly Li <colyli@kernel.org>
Cc: Mike Snitzer <snitzer@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Fixes: 127186cfb1 ("md: reintroduce md-linear")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250129225636.2667932-1-bvanassche@acm.org
Signed-off-by: Song Liu <song@kernel.org>
2025-01-31 10:18:50 -08:00
Joel Granados 1751f872cc treewide: const qualify ctl_tables where applicable
Add the const qualifier to all the ctl_tables in the tree except for
watchdog_hardlockup_sysctl, memory_allocation_profiling_sysctls,
loadpin_sysctl_table and the ones calling register_net_sysctl (./net,
drivers/inifiniband dirs). These are special cases as they use a
registration function with a non-const qualified ctl_table argument or
modify the arrays before passing them on to the registration function.

Constifying ctl_table structs will prevent the modification of
proc_handler function pointers as the arrays would reside in .rodata.
This is made possible after commit 78eb4ea25c ("sysctl: treewide:
constify the ctl_table argument of proc_handlers") constified all the
proc_handlers.

Created this by running an spatch followed by a sed command:
Spatch:
    virtual patch

    @
    depends on !(file in "net")
    disable optional_qualifier
    @

    identifier table_name != {
      watchdog_hardlockup_sysctl,
      iwcm_ctl_table,
      ucma_ctl_table,
      memory_allocation_profiling_sysctls,
      loadpin_sysctl_table
    };
    @@

    + const
    struct ctl_table table_name [] = { ... };

sed:
    sed --in-place \
      -e "s/struct ctl_table .table = &uts_kern/const struct ctl_table *table = \&uts_kern/" \
      kernel/utsname_sysctl.c

Reviewed-by: Song Liu <song@kernel.org>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org> # for kernel/trace/
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> # SCSI
Reviewed-by: Darrick J. Wong <djwong@kernel.org> # xfs
Acked-by: Jani Nikula <jani.nikula@intel.com>
Acked-by: Corey Minyard <cminyard@mvista.com>
Acked-by: Wei Liu <wei.liu@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Acked-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Acked-by: Anna Schumaker <anna.schumaker@oracle.com>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
2025-01-28 13:48:37 +01:00
Linus Torvalds 9629d83f05 - fix a spelling error in dm-raid
- change kzalloc to kcalloc
 
 - remove useless test in alloc_multiple_bios
 
 - disable REQ_NOWAIT for flushes
 
 - dm-transaction-manager: use red-black trees instead of linear lists
 
 - atomic writes support for dm-linear, dm-stripe and dm-mirror
 
 - dm-crypt: code cleanups and two bugfixes
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRnH8MwLyZDhyYfesYTAyx9YGnhbQUCZ5dX9RQcbXBhdG9ja2FA
 cmVkaGF0LmNvbQAKCRATAyx9YGnhba9VAP97UEbvgxZU4UnysTZc+4t9eUlmWmmU
 Tf/ERJGoi/nKXQEAr//Zj5oDLBxd80hgR8iDqLeG3L/QH8vMd8IxLwWJQg8=
 =pRsj
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device mapper updates from Mikulas Patocka:

 - fix a spelling error in dm-raid

 - change kzalloc to kcalloc

 - remove useless test in alloc_multiple_bios

 - disable REQ_NOWAIT for flushes

 - dm-transaction-manager: use red-black trees instead of linear lists

 - atomic writes support for dm-linear, dm-stripe and dm-mirror

 - dm-crypt: code cleanups and two bugfixes

* tag 'for-6.14/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
  dm-crypt: track tag_offset in convert_context
  dm-crypt: don't initialize cc_sector again
  dm-crypt: don't update io->sector after kcryptd_crypt_write_io_submit()
  dm-crypt: use bi_sector in bio when initialize integrity seed
  dm-crypt: fully initialize clone->bi_iter in crypt_alloc_buffer()
  dm-crypt: set atomic as false when calling crypt_convert() in kworker
  dm-mirror: Support atomic writes
  dm-io: Warn on creating multiple atomic write bios for a region
  dm-stripe: Enable atomic writes
  dm-linear: Enable atomic writes
  dm: Ensure cloned bio is same length for atomic write
  dm-table: atomic writes support
  dm-transaction-manager: use red-black trees instead of linear lists
  dm: disable REQ_NOWAIT for flushes
  dm: remove useless test in alloc_multiple_bios
  dm: change kzalloc to kcalloc
  dm raid: fix spelling errors in raid_ctr()
2025-01-27 17:06:42 -08:00
Yu Kuai 8d28d0ddb9 md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime
After commit ec6bb299c7 ("md/md-bitmap: add 'sync_size' into struct
md_bitmap_stats"), following panic is reported:

Oops: general protection fault, probably for non-canonical address
RIP: 0010:bitmap_get_stats+0x2b/0xa0
Call Trace:
 <TASK>
 md_seq_show+0x2d2/0x5b0
 seq_read_iter+0x2b9/0x470
 seq_read+0x12f/0x180
 proc_reg_read+0x57/0xb0
 vfs_read+0xf6/0x380
 ksys_read+0x6c/0xf0
 do_syscall_64+0x82/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Root cause is that bitmap_get_stats() can be called at anytime if mddev
is still there, even if bitmap is destroyed, or not fully initialized.
Deferenceing bitmap in this case can crash the kernel. Meanwhile, the
above commit start to deferencing bitmap->storage, make the problem
easier to trigger.

Fix the problem by protecting bitmap_get_stats() with bitmap_info.mutex.

Cc: stable@vger.kernel.org # v6.12+
Fixes: 32a7627cf3 ("[PATCH] md: optimised resync using Bitmap based intent logging")
Reported-and-tested-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Closes: https://lore.kernel.org/linux-raid/ca3a91a2-50ae-4f68-b317-abd9889f3907@oracle.com/T/#m6e5086c95201135e4941fe38f9efa76daf9666c5
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250124092055.4050195-1-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2025-01-24 10:03:32 -08:00
Hou Tao 8b8f803776 dm-crypt: track tag_offset in convert_context
dm-crypt uses tag_offset to index the integrity metadata for each crypt
sector. When the initial crypt_convert() returns BLK_STS_DEV_RESOURCE,
dm-crypt will try to continue the crypt/decrypt procedure in a kworker.
However, it resets tag_offset as zero instead of using the tag_offset
related with current sector. It may return unexpected data when using
random IV or return unexpected integrity related error.

Fix the problem by tracking tag_offset in per-IO convert_context.
Therefore, when the crypt/decrypt procedure continues in a kworker, it
could use the next tag_offset saved in convert_context.

Fixes: 8abec36d12 ("dm crypt: do not wait for backlogged crypto request completion in softirq")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 22:02:12 +01:00
Hou Tao 996c451d98 dm-crypt: don't initialize cc_sector again
For aead_recheck case, cc_sector has already been initialized in
crypt_convert_init() when trying to re-read the read. Therefore, remove
the duplicated initialization.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 22:02:12 +01:00
Hou Tao 9fdbbdbbc9 dm-crypt: don't update io->sector after kcryptd_crypt_write_io_submit()
The updates of io->sector are the leftovers when dm-crypt allocated
pages for partial write request. However, since commit cf2f1abfbd
("dm crypt: don't allocate pages for a partial request"), there is no
partial request anymore.

After the introduction of write request rb-tree, the updates of
io->sectors may interfere the insertion procedure, because ->sectors of
these write requests which have already been added in the rb-tree may be
changed during the insertion of new write request.

Fix it by removing these buggy updates of io->sectors. Considering these
updates only effect the write request rb-tree, the commit which
introduces the write request rb-tree is used as the fix tag.

Fixes: b3c5fd3052 ("dm crypt: sort writes")
Cc: stable@vger.kernel.org
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 22:02:12 +01:00
Hou Tao 2f8c28d0d9 dm-crypt: use bi_sector in bio when initialize integrity seed
bio->bi_iter.bi_sector has already been initialized when initialize the
integrity seed in dm_crypt_integrity_io_alloc(). There is no need to
calculate it again. Therefore, use the helper bip_set_seed() to
initialize the seed and pass bi_iter.bi_sector to it instead.

Mikulas: We can't use bip_set_seed because it doesn't compile without
CONFIG_BLK_DEV_INTEGRITY.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 22:00:57 +01:00
Hou Tao 7c88f7cfab dm-crypt: fully initialize clone->bi_iter in crypt_alloc_buffer()
Both kcryptd_io_read() and kcryptd_crypt_write_convert() will invoke
crypt_alloc_buffer() to allocate a new bio. Both of these two callers
initialize bi_iter.bi_sector for the new bio separatedly after
crypt_alloc_buffer() returns. However, kcryptd_crypt_write_convert()
will copy the bi_iter of the new bio into ctx.iter_out or ctx.iter_in.
Although it doesn't incur any harm now, it is better to fully initialize
bi_iter before it is used.

Therefore, initialize bi_iter.bi_sector in crypt_alloc_buffer() instead.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 13:25:09 +01:00
Hou Tao 6fd2cb38c0 dm-crypt: set atomic as false when calling crypt_convert() in kworker
Both kcryptd_crypt_write_continue() and kcryptd_crypt_read_continue()
are running in the kworker context, it is OK to call cond_resched(),
Therefore, set atomic as false when invoking crypt_convert() under
kworker context.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Ignat Korchagin <ignat@cloudflare.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-21 13:24:35 +01:00
Linus Torvalds 1cbfb828e0 for-6.14/block-20250118
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmeL6hoQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgppw2EADQV8nDgLRggZR+il4U03yKHXcQEdAX1GrB
 Erowx+dasIJuh6kp3n6qRe9QD/pRqt1DKyLvXoWF8Qfuwq85j7oDnDDYxutNYT27
 hDgrLJriJ3VeKYtTu+andHWt8P29b5h57UayInDOUJurEPA6rXyFZ5YVIti8n21K
 uDOrQXiACG3qRWS2+p2f3UNhX0MkFNFdN/lxi13WMIJtRWF5bXAP+JOgIWCID4Ze
 QuSY6rQD4dp4Q6M2erpX6tn0YZb7Hvw3rPjsd91n6jvYfTUVLH375zg8jCBpi6Wi
 Syufbb8xcTtriVPTDRNu0ekjebkc8wD8ax/h86g0z9v3Ua4DlNmsx9eXrtv6r5nu
 YXqDODOad6stI0+owFquW2vas0gHmfNSfyfGdlk2g24PMtP5Yx0V6FIEvwIeqnje
 ghgxQvBuKUsdhqakByfNnc+XvXi3+RUJek8kvMeUSUQWT1IyMQqPOOk0yp9WdyWD
 bY1f2ECP5BR1b37zYOyawewsI5xTupHUswn5a4r4qtGn3O15rGDkX98Nab5aLCnR
 rW/DvX7+wT6gW9EwrRHiwjwfNDZbsJ9Ggu3lMhtUl5GUWdk58yTiVgKaHJLnlX9/
 CKFKfyyIR1Vl8+gYIpemyFhhcoN+dCSf06ISkrg0jeS0/tYwydaAaCBPL5J4kxZA
 h3Rtbh+Pgg==
 =EXYs
 -----END PGP SIGNATURE-----

Merge tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux

Pull block updates from Jens Axboe:

 - NVMe pull requests via Keith:
      - Target support for PCI-Endpoint transport (Damien)
      - TCP IO queue spreading fixes (Sagi, Chaitanya)
      - Target handling for "limited retry" flags (Guixen)
      - Poll type fix (Yongsoo)
      - Xarray storage error handling (Keisuke)
      - Host memory buffer free size fix on error (Francis)

 - MD pull requests via Song:
      - Reintroduce md-linear (Yu Kuai)
      - md-bitmap refactor and fix (Yu Kuai)
      - Replace kmap_atomic with kmap_local_page (David Reaver)

 - Quite a few queue freeze and debugfs deadlock fixes

   Ming introduced lockdep support for this in the 6.13 kernel, and it
   has (unsurprisingly) uncovered quite a few issues

 - Use const attributes for IO schedulers

 - Remove bio ioprio wrappers

 - Fixes for stacked device atomic write support

 - Refactor queue affinity helpers, in preparation for better supporting
   isolated CPUs

 - Cleanups of loop O_DIRECT handling

 - Cleanup of BLK_MQ_F_* flags

 - Add rotational support for null_blk

 - Various fixes and cleanups

* tag 'for-6.14/block-20250118' of git://git.kernel.dk/linux: (106 commits)
  block: Don't trim an atomic write
  block: Add common atomic writes enable flag
  md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
  block: limit disk max sectors to (LLONG_MAX >> 9)
  block: Change blk_stack_atomic_writes_limits() unit_min check
  block: Ensure start sector is aligned for stacking atomic writes
  blk-mq: Move more error handling into blk_mq_submit_bio()
  block: Reorder the request allocation code in blk_mq_submit_bio()
  nvme: fix bogus kzalloc() return check in nvme_init_effects_log()
  md/md-bitmap: move bitmap_{start, end}write to md upper layer
  md/raid5: implement pers->bitmap_sector()
  md: add a new callback pers->bitmap_sector()
  md/md-bitmap: remove the last parameter for bimtap_ops->endwrite()
  md/md-bitmap: factor behind write counters out from bitmap_{start/end}write()
  md: Replace deprecated kmap_atomic() with kmap_local_page()
  md: reintroduce md-linear
  partitions: ldm: remove the initial kernel-doc notation
  blk-cgroup: rwstat: fix kernel-doc warnings in header file
  blk-cgroup: fix kernel-doc warnings in header file
  nbd: fix partial sending
  ...
2025-01-20 19:38:46 -08:00
John Garry 6845de78ae dm-mirror: Support atomic writes
Support atomic writes by setting DM_TARGET_ATOMIC_WRITES for the target
type, and also unmasking REQ_ATOMIC from the submitted bio op flags.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:24:11 +01:00
John Garry c6a657d968 dm-io: Warn on creating multiple atomic write bios for a region
A region should just be for a single atomic write, so warn when we are
creating many. This should not occur if request queue limits are properly
configured.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:24:09 +01:00
John Garry 30b88ed06f dm-stripe: Enable atomic writes
Set feature flag DM_TARGET_ATOMIC_WRITES.

Similar to md raid0, the chunk size set in stripe_io_hints() limits the
atomic write unit max.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:24:06 +01:00
John Garry 487d1a9cb5 dm-linear: Enable atomic writes
Set feature flag DM_TARGET_ATOMIC_WRITES.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:24:04 +01:00
John Garry 5f430e3408 dm: Ensure cloned bio is same length for atomic write
For an atomic write, a cloned bio must be same length as the original bio,
i.e. no splitting.

Error in case it is not.

Per-dm device queue limits should be setup to ensure that this does not
happen, but error this case as an insurance policy.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:24:01 +01:00
John Garry 3194e36488 dm-table: atomic writes support
Support stacking atomic write limits for DM devices.

All the pre-existing code in blk_stack_atomic_writes_limits() already takes
care of finding the aggregrate limits from the bottom devices.

Feature flag DM_TARGET_ATOMIC_WRITES is introduced so that atomic writes
can be enabled on personalities selectively. This is to ensure that atomic
writes are only enabled when verified to be working properly (for a
specific personality). In addition, it just may not make sense to enable
atomic writes on some personalities (so this flag also helps there).

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:23:47 +01:00
Mikulas Patocka a38425935f dm-transaction-manager: use red-black trees instead of linear lists
There was reported performance degradation when the shadow map contained
too many entries [1]. The shadow map uses 256-bucket hash with linear
lists - when there are too many entries, it has quadratic complexity.

Meir Elisha proposed to add a module parameter that could configure the
size of the hash array - however, this is not ideal because users don't
know that they should increase the parameter when they get bad
performance.

This commit replaces the linear lists with rb-trees (so that there's a
hash of rb-trees), they have logarithmic complexity, so it solves the
performance degradation.

Link: https://patchwork.kernel.org/project/dm-devel/patch/20241014134944.1264991-1-meir.elisha@volumez.com/ [1]
Reported-by: Meir Elisha <meir.elisha@volumez.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:05:40 +01:00
Mikulas Patocka cd6521d03f dm: disable REQ_NOWAIT for flushes
REQ_NOWAIT for flushes cannot be easily supported by device mapper
because it may allocate multiple bios and its impossible to undo if one
of those allocations wants to wait. So, this patch disables REQ_NOWAIT
flushes in device mapper and we always return EAGAIN.

Previously, the code accepted REQ_NOWAIT flushes, but the non-blocking
execution was not guaranteed.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:05:39 +01:00
Mikulas Patocka 6942348d1b dm: remove useless test in alloc_multiple_bios
The test gfp_flag & GFP_NOWAIT was always true, because both GFP_NOIO and
GFP_NOWAIT include the flag __GFP_KSWAPD_RECLAIM. Luckily, this oversight
didn't result in any harm; the loop always started with "try = 0".

This patch removes the faulty test and explicitly starts the loop with
"try = 0" (so that we don't hold the mutex in the first iteration).

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:05:39 +01:00
Ethan Carter Edwards c3a1f2ef72 dm: change kzalloc to kcalloc
Use 2-factor multiplication argument form kcalloc() instead
of instead of the deprecated kzalloc() [1].

[1] https://www.kernel.org/doc/html/next/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments
Link: https://github.com/KSPP/linux/issues/162

Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Reviewed-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:05:39 +01:00
liujing 193700b9b2 dm raid: fix spelling errors in raid_ctr()
Fix the respective spelling errors in raid_ctr() function.

Signed-off-by: liujing <liujing@cmss.chinamobile.com>
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
2025-01-17 22:05:39 +01:00
John Garry 6a7e17b220 block: Add common atomic writes enable flag
Currently only stacked devices need to explicitly enable atomic writes by
setting BLK_FEAT_ATOMIC_WRITES_STACKED flag.

This does not work well for device mapper stacking devices, as there many
sets of limits are stacked and what is the 'bottom' and 'top' device can
swapped. This means that BLK_FEAT_ATOMIC_WRITES_STACKED needs to be set
for many queue limits, which is messy.

Generalize enabling atomic writes enabling by ensuring that all devices
must explicitly set a flag - that includes NVMe, SCSI sd, and md raid.

Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20250116170301.474130-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-17 13:13:54 -07:00
Dan Carpenter 62c552070a md/md-linear: Fix a NULL vs IS_ERR() bug in linear_add()
The linear_conf() returns error pointers, it doesn't return NULL.  Update
the error checking to match.

Fixes: 127186cfb1 ("md: reintroduce md-linear")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/add654be-759f-4b2d-93ba-a3726dae380c@stanley.mountain
Signed-off-by: Song Liu <song@kernel.org>
2025-01-16 09:31:25 -08:00
Yu Kuai cd5fc65338 md/md-bitmap: move bitmap_{start, end}write to md upper layer
There are two BUG reports that raid5 will hang at
bitmap_startwrite([1],[2]), root cause is that bitmap start write and end
write is unbalanced, it's not quite clear where, and while reviewing raid5
code, it's found that bitmap operations can be optimized. For example,
for a 4 disks raid5, with chunksize=8k, if user issue a IO (0 + 48k) to
the array:

┌────────────────────────────────────────────────────────────┐
│chunk 0                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┼
│  sh0 │A0: 0 + 4k  │A1: 8k + 4k  │A2: 16k + 4k │A3: P       │
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh1 │B0: 4k + 4k │B1: 12k + 4k │B2: 20k + 4k │B3: P       │
┼──────┴────────────┴─────────────┴─────────────┴────────────┼
│chunk 1                                                     │
│      ┌────────────┬─────────────┬─────────────┬────────────┤
│  sh2 │C0: 24k + 4k│C1: 32k + 4k │C2: P        │C3: 40k + 4k│
│      ┼────────────┼─────────────┼─────────────┼────────────┼
│  sh3 │D0: 28k + 4k│D1: 36k + 4k │D2: P        │D3: 44k + 4k│
└──────┴────────────┴─────────────┴─────────────┴────────────┘

Before this patch, 4 stripe head will be used, and each sh will attach
bio for 3 disks, and each attached bio will trigger
bitmap_startwrite() once, which means total 12 times.
 - 3 times (0 + 4k), for (A0, A1 and A2)
 - 3 times (4 + 4k), for (B0, B1 and B2)
 - 3 times (8 + 4k), for (C0, C1 and C3)
 - 3 times (12 + 4k), for (D0, D1 and D3)

After this patch, md upper layer will calculate that IO range (0 + 48k)
is corresponding to the bitmap (0 + 16k), and call bitmap_startwrite()
just once.

Noted that this patch will align bitmap ranges to the chunks, for example,
if user issue a IO (0 + 4k) to array:

- Before this patch, 1 time (0 + 4k), for A0;
- After this patch, 1 time (0 + 8k) for chunk 0;

Usually, one bitmap bit will represent more than one disk chunk, and this
doesn't have any difference. And even if user really created a array
that one chunk contain multiple bits, the overhead is that more data
will be recovered after power failure.

Also remove STRIPE_BITMAP_PENDING since it's not used anymore.

[1] https://lore.kernel.org/all/CAJpMwyjmHQLvm6zg1cmQErttNNQPDAAXPKM3xgTjMhbfts986Q@mail.gmail.com/
[2] https://lore.kernel.org/all/ADF7D720-5764-4AF3-B68E-1845988737AA@flyingcircus.io/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-6-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2025-01-13 08:56:11 -08:00
Yu Kuai 9c89f60447 md/raid5: implement pers->bitmap_sector()
Bitmap is used for the whole array for raid1/raid10, hence IO for the
array can be used directly for bitmap. However, bitmap is used for
underlying disks for raid5, hence IO for the array can't be used
directly for bitmap.

Implement pers->bitmap_sector() for raid5 to convert IO ranges from the
array to the underlying disks.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250109015145.158868-5-yukuai1@huaweicloud.com
Signed-off-by: Song Liu <song@kernel.org>
2025-01-13 08:56:11 -08:00