The cited commit made fs.c always compile, even when
INFINIBAND_USER_ACCESS isn't set. This results in a compilation
warning about an unused object when compiling with W=1 and
USER_ACCESS is unset.
Fix this by defining uverbs_destroy_def_handler() even when
USER_ACCESS isn't set.
Fixes: 36e0d43367 ("RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config")
Link: https://patch.msgid.link/r/20250402070944.1022093-1-mbloch@nvidia.com
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.
Conversion was done with coccinelle plus manual fixups where necessary.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser, mlx5,
vmw_pvrdma, hns
- Make rxe work on tun devices
- mana gains more standard verbs as it moves toward supporting in-kernel
verbs
- DMABUF support for mana
- Fix page size calculations when memory registration exceeds 4G
- On Demand Paging support for rxe
- mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism to
access control use of them
- Optional RDMA_TX/RX counters per QP in mlx5
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+ap4gAKCRCFwuHvBreF
YaFHAP9wyeZCZIbnWaGcbNdbsmkEgy7aTVILRHf1NA7VSJ211gD9Ha60E+mkwtvA
i7IJ49R2BdqzKaO9oTutj2Lw+8rABwQ=
=qXhh
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
- Usual minor updates and fixes for bnxt_re, hfi1, rxe, mana, iser,
mlx5, vmw_pvrdma, hns
- Make rxe work on tun devices
- mana gains more standard verbs as it moves toward supporting
in-kernel verbs
- DMABUF support for mana
- Fix page size calculations when memory registration exceeds 4G
- On Demand Paging support for rxe
- mlx5 support for RDMA TRANSPORT flow tables and a new ucap mechanism
to access control use of them
- Optional RDMA_TX/RX counters per QP in mlx5
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (73 commits)
IB/mad: Check available slots before posting receive WRs
RDMA/mana_ib: Fix integer overflow during queue creation
RDMA/mlx5: Fix calculation of total invalidated pages
RDMA/mlx5: Fix mlx5_poll_one() cur_qp update flow
RDMA/mlx5: Fix page_size variable overflow
RDMA/mlx5: Drop access_flags from _mlx5_mr_cache_alloc()
RDMA/mlx5: Fix cache entry update on dereg error
RDMA/mlx5: Fix MR cache initialization error flow
RDMA/mlx5: Support optional-counters binding for QPs
RDMA/mlx5: Compile fs.c regardless of INFINIBAND_USER_ACCESS config
RDMA/core: Pass port to counter bind/unbind operations
RDMA/core: Add support to optional-counters binding configuration
RDMA/core: Create and destroy rdma_counter using rdma_zalloc_drv_obj()
RDMA/mlx5: Add optional counters for RDMA_TX/RX_packets/bytes
RDMA/core: Fix use-after-free when rename device name
RDMA/bnxt_re: Support perf management counters
RDMA/rxe: Fix incorrect return value of rxe_odp_atomic_op()
RDMA/uverbs: Propagate errors from rdma_lookup_get_uobject()
RDMA/mana_ib: Handle net event for pointing to the current netdev
net: mana: Change the function signature of mana_get_primary_netdev_rcu
...
When invalidating an address range in mlx5, there is an optimization to
do UMR operations in chunks.
Previously, the invalidation counter was incorrectly updated for the
same indexes within a chunk. Now, the invalidation counter is updated
only when a chunk is complete and mlx5r_umr_update_xlt() is called.
This ensures that the counter accurately represents the number of pages
invalidated using UMR.
Fixes: a3de94e3d6 ("IB/mlx5: Introduce ODP diagnostic counters")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/560deb2433318e5947282b070c915f3c81fef77f.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When cur_qp isn't NULL, in order to avoid fetching the QP from
the radix tree again we check if the next cqe QP is identical to
the one we already have.
The bug however is that we are checking if the QP is identical by
checking the QP number inside the CQE against the QP number inside the
mlx5_ib_qp, but that's wrong since the QP number from the CQE is from
FW so it should be matched against mlx5_core_qp which is our FW QP
number.
Otherwise we could use the wrong QP when handling a CQE which could
cause the kernel trace below.
This issue is mainly noticeable over QPs 0 & 1, since for now they are
the only QPs in our driver whereas the QP number inside mlx5_ib_qp
doesn't match the QP number inside mlx5_core_qp.
BUG: kernel NULL pointer dereference, address: 0000000000000012
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] SMP
CPU: 0 UID: 0 PID: 7927 Comm: kworker/u62:1 Not tainted 6.14.0-rc3+ #189
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: ib-comp-unb-wq ib_cq_poll_work [ib_core]
RIP: 0010:mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
Code: 03 00 00 8d 58 ff 21 cb 66 39 d3 74 39 48 c7 c7 3c 89 6e a0 0f b7 db e8 b7 d2 b3 e0 49 8b 86 60 03 00 00 48 c7 c7 4a 89 6e a0 <0f> b7 5c 98 02 e8 9f d2 b3 e0 41 0f b7 86 78 03 00 00 83 e8 01 21
RSP: 0018:ffff88810511bd60 EFLAGS: 00010046
RAX: 0000000000000010 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffff88885fa1b3c0 RDI: ffffffffa06e894a
RBP: 00000000000000b0 R08: 0000000000000000 R09: ffff88810511bc10
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88810d593000
R13: ffff88810e579108 R14: ffff888105146000 R15: 00000000000000b0
FS: 0000000000000000(0000) GS:ffff88885fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000012 CR3: 00000001077e6001 CR4: 0000000000370eb0
Call Trace:
<TASK>
? __die+0x20/0x60
? page_fault_oops+0x150/0x3e0
? exc_page_fault+0x74/0x130
? asm_exc_page_fault+0x22/0x30
? mlx5_ib_poll_cq+0x4c7/0xd90 [mlx5_ib]
__ib_process_cq+0x5a/0x150 [ib_core]
ib_cq_poll_work+0x31/0x90 [ib_core]
process_one_work+0x169/0x320
worker_thread+0x288/0x3a0
? work_busy+0xb0/0xb0
kthread+0xd7/0x1f0
? kthreads_online_cpu+0x130/0x130
? kthreads_online_cpu+0x130/0x130
ret_from_fork+0x2d/0x50
? kthreads_online_cpu+0x130/0x130
ret_from_fork_asm+0x11/0x20
</TASK>
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Edward Srouji <edwards@nvidia.com>
Link: https://patch.msgid.link/4ada09d41f1e36db62c44a9b25c209ea5f054316.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix double decrement of 'in_use' counter on push_mkey_locked() failure
while deregistering an MR.
If we fail to return an mkey to the cache in cache_ent_find_and_store()
it'll update the 'in_use' counter. Its caller, revoke_mr(), also updates
it, thus having double decrement.
Wrong value of 'in_use' counter will be exposed through debugfs and can
also cause wrong resizing of the cache when users try to set cache
entry size using the 'size' debugfs.
To address this issue, the 'in_use' counter is now decremented within
mlx5_revoke_mr() also after a successful call to
cache_ent_find_and_store() and not within cache_ent_find_and_store().
Other success or failure flows remains unchanged where it was also
decremented.
Fixes: 8c1185fef6 ("RDMA/mlx5: Change check for cacheable mkeys")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/97e979dff636f232ff4c83ce709c17c727da1fdb.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Destroy all previously created cache entries and work queue when rolling
back the MR cache initialization upon an error.
Fixes: 73d09b2fe8 ("RDMA/mlx5: Introduce mlx5r_cache_rb_key")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/c41d525fb3c72e28dd38511bf3aaccb5d584063e.1741875692.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add support to allow optional-counters binding to a QP, whereas when
a bind operation is requested depending on the counter optional-counter
binding state the driver will determine if to also add optional-counters
to this QP binding.
The optional-counter binding is done by simply adding a steering
rule for the specific optional-counter condition with the additional
match over that QP number.
Note that optional-counters per QP rules are handled on an earlier prio
than per device counters, and per device counter correctness is maintained
by core whereas it is responsible to sum active counters when checking device
counter and to add them to history count when they are deallocated.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/2cad1b891a6641ae61fe8d92f867e1059121813a.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This will be useful for the next patches in the series since port number
is needed for optional counters binding and unbinding.
Note that this change is needed since when the operation is done qp->port
isn't necessarily initialized yet and can't be used.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/b6f6797844acbd517358e8d2a270ea9b3e6ecba1.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Change rdma_counter allocation to use rdma_zalloc_drv_obj() instead of,
explicitly allocating at core, in order to be contained inside driver
specific structures.
Adjust all drivers that use it to have their containing structure, and
add driver specific initialization operation.
This change is needed to allow upcoming patches to implement
optional-counters binding whereas inside each driver specific counter
struct his bound optional-counters will be maintained.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/a5a484f421fc2e5595158e61a354fba43272b02d.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Add the following optional counters:
rdma_tx_packets,rdma_rx_bytes,rdma_rx_packets,rdma_tx_bytes.
Which counts all RDMA packets/bytes sent and received per link.
Note that since each direction packet and byte counter are shared,
the counter is only reset when both counters of that direction
are removed. But from user-perspective each can be enabled/disabled separately.
The counters can be enabled using:
sudo rdma stat set link rocep8s0f0/1 optional-counters rdma_tx_packets
And can be seen using:
rdma stat -j show link rocep8s0f0/1
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Link: https://patch.msgid.link/9f2753ad636f21704416df64b47395c8991d1123.1741875070.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This patch adds RDMA_TRANSPORT_RX and RDMA_TRANSPORT_TX as a new flow
table type for matcher creation.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://patch.msgid.link/2287d8c50483e880450c7e8e08d9de34cdec1b14.1741261611.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Verify that the enabled UCAPs are supported by the device before
creating the ucontext.
If supported, create the ucontext with the associated capabilities.
Store the privileged ucontext UID on creation and remove it when
destroying the privileged ucontext. This allows the command interface
to recognize privileged commands through its UID.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/8b180583a207cb30deb7a2967934079749cdcc44.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Create UCAP character devices when probing an IB device with supported
firmware capabilities.
If the RDMA_CTRL general object type is supported, check for specific
UCTX capabilities:
Create /dev/infiniband/mlx5_perm_ctrl_local for RDMA_UCAP_MLX5_CTRL_LOCAL
Create /dev/infiniband/mlx5_perm_ctrl_other_vhca for RDMA_UCAP_MLX5_CTRL_OTHER_VHCA
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/30ed40e7a12a694cf4ee257459ed61b145b7837d.1741261611.git.leon@kernel.org
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
In function create_ib_ah() the following line attempts
to left shift the return value of mlx5r_ib_rate() by 4
and store it in the stat_rate_sl member of av:
However the code overlooks the fact that mlx5r_ib_rate()
may return -EINVAL if the rate passed to it is less than
IB_RATE_2_5_GBPS or greater than IB_RATE_800_GBPS.
Because of this, the code may invoke undefined behaviour when
shifting a signed negative value when doing "-EINVAL << 4".
To fix this check for errors before assigning stat_rate_sl and
propagate any error value to the callers.
Fixes: c534ffda78 ("RDMA/mlx5: Fix AH static rate parsing")
Signed-off-by: Qasim Ijaz <qasdev00@gmail.com>
Link: https://patch.msgid.link/20250304140246.205919-1-qasdev00@gmail.com
Reviewed-by: Patrisious Haddad <phaddad@nvidia.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
capable() calls refer to enabled LSMs whether to permit or deny the
request. This is relevant in connection with SELinux, where a
capability check results in a policy decision and by default a denial
message on insufficient permission is issued.
It can lead to three undesired cases:
1. A denial message is generated, even in case the operation was an
unprivileged one and thus the syscall succeeded, creating noise.
2. To avoid the noise from 1. the policy writer adds a rule to ignore
those denial messages, hiding future syscalls, where the task
performs an actual privileged operation, leading to hidden limited
functionality of that task.
3. To avoid the noise from 1. the policy writer adds a rule to permit
the task the requested capability, while it does not need it,
violating the principle of least privilege.
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://patch.msgid.link/20250302160657.127253-10-cgoettsche@seltendoof.de
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When there is a failure during bind QP, the cleanup flow destroys the
counter regardless if it is the one that created it or not, which is
problematic since if it isn't the one that created it, that counter could
still be in use.
Fix that by destroying the counter only if it was created during this call.
Fixes: 45842fc627 ("IB/mlx5: Support statistic q counter configuration")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Zhang <markzhang@nvidia.com>
Link: https://patch.msgid.link/25dfefddb0ebefa668c32e06a94d84e3216257cf.1740033937.git.leon@kernel.org
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Previously static rate wasn't translated according to our PRM but simply
used the 4 lower bytes.
Correctly translate static rate value passed in AH creation attribute
according to our PRM expected values.
In addition change 800GB mapping to zero, which is the PRM
specified value.
Fixes: e126ba97db ("mlx5: Add driver for Mellanox Connect-IB adapters")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://patch.msgid.link/18ef4cc5396caf80728341eb74738cd777596f60.1739187089.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This patch addresses a potential race condition for a DMABUF MR that can
result in a CQE with an error on the UMR QP.
During the __mlx5_ib_dereg_mr() flow, the following sequence of calls
occurs:
mlx5_revoke_mr()
mlx5r_umr_revoke_mr()
mlx5r_umr_post_send_wait()
At this point, the lkey is freed from the hardware's perspective.
However, concurrently, mlx5_ib_dmabuf_invalidate_cb() might be triggered
by another task attempting to invalidate the MR having that freed lkey.
Since the lkey has already been freed, this can lead to a CQE error,
causing the UMR QP to enter an error state.
To resolve this race condition, the dma_resv_lock() which was hold as
part of the mlx5_ib_dmabuf_invalidate_cb() is now also acquired as part
of the mlx5_revoke_mr() scope.
Upon a successful revoke, we set umem_dmabuf->private which points to
that MR to NULL, preventing any further invalidation attempts on its
lkey.
Fixes: e6fb246cca ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Artemy Kovalyov <artemyko@mnvidia.com>
Link: https://patch.msgid.link/70617067abbfaa0c816a2544c922e7f4346def58.1738587016.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
When a DCT QP is created on an active lag, it's dctc.port is assigned
in a round-robin way, which is from 1 to dev->lag_port. In this case
when querying this QP, we may get qp_attr.port_num > 2.
Fix this by setting qp->port when modifying a DCT QP, and read port_num
from qp->port instead of dctc.port when querying it.
Fixes: 7c4b1ab9f1 ("IB/mlx5: Add DCT RoCE LAG support")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Link: https://patch.msgid.link/94c76bf0adbea997f87ffa27674e0a7118ad92a9.1737290358.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
This patch addresses an issue in the recovery flow of the UMR QP,
ensuring tasks do not get stuck, as highlighted by the call trace [1].
During recovery, before transitioning the QP to the RESET state, the
software must wait for all outstanding WRs to complete.
Failing to do so can cause the firmware to skip sending some flushed
CQEs with errors and simply discard them upon the RESET, as per the IB
specification.
This race condition can result in lost CQEs and tasks becoming stuck.
To resolve this, the patch sends a final WR which serves only as a
barrier before moving the QP state to RESET.
Once a CQE is received for that final WR, it guarantees that no
outstanding WRs remain, making it safe to transition the QP to RESET and
subsequently back to RTS, restoring proper functionality.
Note:
For the barrier WR, we simply reuse the failed and ready WR.
Since the QP is in an error state, it will only receive
IB_WC_WR_FLUSH_ERR. However, as it serves only as a barrier we don't
care about its status.
[1]
INFO: task rdma_resource_l:1922 blocked for more than 120 seconds.
Tainted: G W 6.12.0-rc7+ #1626
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:rdma_resource_l state:D stack:0 pid:1922 tgid:1922 ppid:1369
flags:0x00004004
Call Trace:
<TASK>
__schedule+0x420/0xd30
schedule+0x47/0x130
schedule_timeout+0x280/0x300
? mark_held_locks+0x48/0x80
? lockdep_hardirqs_on_prepare+0xe5/0x1a0
wait_for_completion+0x75/0x130
mlx5r_umr_post_send_wait+0x3c2/0x5b0 [mlx5_ib]
? __pfx_mlx5r_umr_done+0x10/0x10 [mlx5_ib]
mlx5r_umr_revoke_mr+0x93/0xc0 [mlx5_ib]
__mlx5_ib_dereg_mr+0x299/0x520 [mlx5_ib]
? _raw_spin_unlock_irq+0x24/0x40
? wait_for_completion+0xfe/0x130
? rdma_restrack_put+0x63/0xe0 [ib_core]
ib_dereg_mr_user+0x5f/0x120 [ib_core]
? lock_release+0xc6/0x280
destroy_hw_idr_uobject+0x1d/0x60 [ib_uverbs]
uverbs_destroy_uobject+0x58/0x1d0 [ib_uverbs]
uobj_destroy+0x3f/0x70 [ib_uverbs]
ib_uverbs_cmd_verbs+0x3e4/0xbb0 [ib_uverbs]
? __pfx_uverbs_destroy_def_handler+0x10/0x10 [ib_uverbs]
? __lock_acquire+0x64e/0x2080
? mark_held_locks+0x48/0x80
? find_held_lock+0x2d/0xa0
? lock_acquire+0xc1/0x2f0
? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs]
? __fget_files+0xc3/0x1b0
ib_uverbs_ioctl+0xe7/0x170 [ib_uverbs]
? ib_uverbs_ioctl+0xcb/0x170 [ib_uverbs]
__x64_sys_ioctl+0x1b0/0xa70
do_syscall_64+0x6b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f99c918b17b
RSP: 002b:00007ffc766d0468 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
RAX: ffffffffffffffda RBX: 00007ffc766d0578 RCX:
00007f99c918b17b
RDX: 00007ffc766d0560 RSI: 00000000c0181b01 RDI:
0000000000000003
RBP: 00007ffc766d0540 R08: 00007f99c8f99010 R09:
000000000000bd7e
R10: 00007f99c94c1c70 R11: 0000000000000246 R12:
00007ffc766d0530
R13: 000000000000001c R14: 0000000040246a80 R15:
0000000000000000
</TASK>
Fixes: 158e71bb69 ("RDMA/mlx5: Add a umr recovery flow")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/27b51b92ec42dfb09d8096fcbd51878f397ce6ec.1737290141.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Lighter that normal, but the now usual collection of driver fixes and
small improvements:
- Small fixes and minor improvements to cxgb4, bnxt_re, rxe, srp, efa,
cxgb4
- Update mlx4 to use the new umem APIs, avoiding direct use of scatterlist
- Support ROCEv2 in erdma
- Remove various uncalled functions, constify bin_attribute
- Provide core infrastructure to catch netdev events and route them to
drivers, consolidating duplicated driver code
- Fix rare race condition crashes in mlx5 ODP flows
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ5JkNAAKCRCFwuHvBreF
YeCwAP9nrzFZMBEa0DVx7V8w3sotQCOzaaoi+4UOigeleppKSAD+K7QA5CIwNf7j
x0apvJlNsXKSYJVUI2rch/qIL8ckQwM=
=wIYu
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Lighter that normal, but the now usual collection of driver fixes and
small improvements:
- Small fixes and minor improvements to cxgb4, bnxt_re, rxe, srp,
efa, cxgb4
- Update mlx4 to use the new umem APIs, avoiding direct use of
scatterlist
- Support ROCEv2 in erdma
- Remove various uncalled functions, constify bin_attribute
- Provide core infrastructure to catch netdev events and route them
to drivers, consolidating duplicated driver code
- Fix rare race condition crashes in mlx5 ODP flows"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (63 commits)
RDMA/mlx5: Fix implicit ODP use after free
RDMA/mlx5: Fix a race for an ODP MR which leads to CQE with error
RDMA/qib: Constify 'struct bin_attribute'
RDMA/hfi1: Constify 'struct bin_attribute'
RDMA/rxe: Fix the warning "__rxe_cleanup+0x12c/0x170 [rdma_rxe]"
RDMA/cxgb4: Notify rdma stack for IB_EVENT_QP_LAST_WQE_REACHED event
RDMA/bnxt_re: Allocate dev_attr information dynamically
RDMA/bnxt_re: Pass the context for ulp_irq_stop
RDMA/bnxt_re: Add support to handle DCB_CONFIG_CHANGE event
RDMA/bnxt_re: Query firmware defaults of CC params during probe
RDMA/bnxt_re: Add Async event handling support
bnxt_en: Add ULP call to notify async events
RDMA/mlx5: Fix indirect mkey ODP page count
MAINTAINERS: Update the bnxt_re maintainers
RDMA/hns: Clean up the legacy CONFIG_INFINIBAND_HNS
RDMA/rtrs: Add missing deinit() call
RDMA/efa: Align interrupt related fields to same type
RDMA/bnxt_re: Fix to drop reference to the mmap entry in case of error
RDMA/mlx5: Fix link status down event for MPV
RDMA/erdma: Support create_ah/destroy_ah in non-sleepable contexts
...
This patch addresses a race condition for an ODP MR that can result in a
CQE with an error on the UMR QP.
During the __mlx5_ib_dereg_mr() flow, the following sequence of calls
occurs:
mlx5_revoke_mr()
mlx5r_umr_revoke_mr()
mlx5r_umr_post_send_wait()
At this point, the lkey is freed from the hardware's perspective.
However, concurrently, mlx5_ib_invalidate_range() might be triggered by
another task attempting to invalidate a range for the same freed lkey.
This task will:
- Acquire the umem_odp->umem_mutex lock.
- Call mlx5r_umr_update_xlt() on the UMR QP.
- Since the lkey has already been freed, this can lead to a CQE error,
causing the UMR QP to enter an error state [1].
To resolve this race condition, the umem_odp->umem_mutex lock is now also
acquired as part of the mlx5_revoke_mr() scope. Upon successful revoke,
we set umem_odp->private which points to that MR to NULL, preventing any
further invalidation attempts on its lkey.
[1] From dmesg:
infiniband rocep8s0f0: dump_cqe:277:(pid 0): WC error: 6, Message: memory bind operation error
cqe_dump: 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cqe_dump: 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cqe_dump: 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
cqe_dump: 00000030: 00 00 00 00 08 00 78 06 25 00 11 b9 00 0e dd d2
WARNING: CPU: 15 PID: 1506 at drivers/infiniband/hw/mlx5/umr.c:394 mlx5r_umr_post_send_wait+0x15a/0x2b0 [mlx5_ib]
Modules linked in: ip6table_mangle ip6table_natip6table_filter ip6_tables iptable_mangle xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
CPU: 15 UID: 0 PID: 1506 Comm: ibv_rc_pingpong Not tainted 6.12.0-rc7+ #1626
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:mlx5r_umr_post_send_wait+0x15a/0x2b0 [mlx5_ib]
[..]
Call Trace:
<TASK>
mlx5r_umr_update_xlt+0x23c/0x3e0 [mlx5_ib]
mlx5_ib_invalidate_range+0x2e1/0x330 [mlx5_ib]
__mmu_notifier_invalidate_range_start+0x1e1/0x240
zap_page_range_single+0xf1/0x1a0
madvise_vma_behavior+0x677/0x6e0
do_madvise+0x1a2/0x4b0
__x64_sys_madvise+0x25/0x30
do_syscall_64+0x6b/0x140
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Fixes: e6fb246cca ("RDMA/mlx5: Consolidate MR destruction to mlx5_ib_dereg_mr()")
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/r/68a1e007c25b2b8fe5d625f238cc3b63e5341f77.1737290229.git.leon@kernel.org
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Restrict the check for the number of pages handled during an ODP page
fault to direct mkeys.
Perform the check right after handling the page fault and don't
propagate the number of handled pages to callers.
Indirect mkeys and their associated direct mkeys can have different
start addresses. As a result, the calculation of the number of pages to
handle for an indirect mkey may not match the actual page fault
handling done on the direct mkey.
For example:
A 4K sized page fault on a KSM mkey that has a start address that is not
aligned to a page will result a calculation that assumes the number of
pages required to handle are 2.
While the underlying MTT might be aligned will require fetching only a
single page.
Thus, do the calculation and compare number of pages handled only per
direct mkey.
Fixes: db570d7dea ("IB/mlx5: Add ODP support to MW")
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Artemy Kovalyov <artemyko@nvidia.com>
Link: https://patch.msgid.link/86c483d9e75ce8fe14e9ff85b62df72b779f8ab1.1736187990.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The commit below prevented MPV from unloading correctly due to blocking
the netdev down event, allow sending the event for MPV mode to maintain
proper unload flow.
Fixes: 3790137762 ("RDMA/mlx5: Handle link status event only for LAG device")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maor Gottlieb <maorg@nvidia.com>
Link: https://patch.msgid.link/d7731478e456f61255af798a7fd4e64b006ddebb.1735567976.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
The link status events of non-LAG devices are now handled in ib_core,
so only LAG device events need to be handled in driver.
Signed-off-by: Yuyu Li <liyuyu6@huawei.com>
Signed-off-by: Junxian Huang <huangjunxian6@hisilicon.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Currently mlx5_flow_destination includes counter_id which is assigned in
case we use flow counter on the flow steering rule. However, counter_id
is not enough data in case of using HW Steering. Thus, have mlx5_fc
object as part of mlx5_flow_destination instead of counter_id and assign
it where needed.
In case counter_id is received from user space, create a local counter
object to represent it.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20241219175841.1094544-4-tariqt@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The current ODP counters represent the total number of pages
handled, but it is not enough to understand the effectiveness
of these operations.
Extend the ODP counters to include the number of times page fault
and invalidation events were handled.
Example for a single page fault handling 512 pages:
- page_fault: incremented by 512 (total pages)
- page_fault_handled: incremented by 1 (operation count)
The same example is applicable for page invalidation too.
Previous output:
$ rdma stat mr
dev rocep8s0f0 mrn 8 page_faults 27 page_invalidations 0 page_prefetch 29
New output:
$ rdma stat mr
dev rocep8s0f0 mrn 21 page_faults 512 page_faults_handled 1
page_invalidations 0 page_invalidations_handled 0 page_prefetch 51200
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/b18f29ed1392996ade66e9e6c45f018925253f6a.1733234165.git.leonro@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Different core device types such as PFs and VFs shouldn't be affiliated
together since they have different capabilities, fix that by enforcing
type check before doing the affiliation.
Fixes: 32f69e4be2 ("{net, IB}/mlx5: Manage port association for multiport RoCE")
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/88699500f690dff1c1852c1ddb71f8a1cc8b956e.1733233480.git.leonro@nvidia.com
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Seveal fixes scattered across the drivers and a few new features:
- Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns
- Force disassociate the userspace FD when hns does an async reset
- bnxt new features for optimized modify QP to skip certain stayes, CQ
coalescing, better debug dumping
- mlx5 new data placement ordering feature
- Faster destruction of mlx5 devx HW objects
- Improvements to RDMA CM mad handling
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZz4ENwAKCRCFwuHvBreF
YQYQAP9R54r5J1Iylg+zqhCc+e/9oveuuZbfLvy/EJiEpmdprQEAgPs1RrB0z7U6
1xrVStUKNPhGd5XeVVZGkIV0zYv6Tw4=
=V5xI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"Seveal fixes scattered across the drivers and a few new features:
- Minor updates and bug fixes to hfi1, efa, iopob, bnxt, hns
- Force disassociate the userspace FD when hns does an async reset
- bnxt new features for optimized modify QP to skip certain stayes,
CQ coalescing, better debug dumping
- mlx5 new data placement ordering feature
- Faster destruction of mlx5 devx HW objects
- Improvements to RDMA CM mad handling"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (51 commits)
RDMA/bnxt_re: Correct the sequence of device suspend
RDMA/bnxt_re: Use the default mode of congestion control
RDMA/bnxt_re: Support different traffic class
IB/cm: Rework sending DREQ when destroying a cm_id
IB/cm: Do not hold reference on cm_id unless needed
IB/cm: Explicitly mark if a response MAD is a retransmission
RDMA/mlx5: Move events notifier registration to be after device registration
RDMA/bnxt_re: Cache MSIx info to a local structure
RDMA/bnxt_re: Refurbish CQ to NQ hash calculation
RDMA/bnxt_re: Refactor NQ allocation
RDMA/bnxt_re: Fail probe early when not enough MSI-x vectors are reserved
RDMA/hns: Fix different dgids mapping to the same dip_idx
RDMA/bnxt_re: Add set_func_resources support for P5/P7 adapters
RDMA/bnxt_re: Enhance RoCE SRIOV resource configuration design
bnxt_en: Add support for RoCE sriov configuration
RDMA/hns: Fix NULL pointer derefernce in hns_roce_map_mr_sg()
RDMA/hns: Fix out-of-order issue of requester when setting FENCE
RDMA/nldev: Add IB device and net device rename events
RDMA/mlx5: Add implementation for ufile_hw_cleanup device operation
RDMA/core: Move ib_uverbs_file struct to uverbs_types.h
...
Implement the device API for ufile_hw_cleanup operation, which
iterates over the ufile uobjects lists, and attempts to destroy
DevX QPs, by issuing up to 8 commands in parallel.
This function is responsible only for cleaning the FW resources of the
QP, and doesn't necessarily cleanup all of its resources.
Hence the normal serialized cleanup flow is still executed after it
in __uverbs_cleanup_ufile() to cleanup the remaining resources and
handle the cleanup of SW objects.
In order to avoid double cleanup for the FW resources, new DevX flag
was added DEVX_OBJ_FLAGS_HW_FREED, which marks the object's FW resources
as already freed.
Since QP destruction is the most time-consuming operation in FW,
parallelizing it reduces the cleanup time of applications that use
DevX QPs.
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Link: https://patch.msgid.link/2f82675d0412542cba1c47a6b86f589521ae41e1.1730373303.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Fix a race condition when creating a lag bond in active backup
mode where after the bond creation the backup slave was
attached to the IB device, instead of the active slave.
This caused stale entries in the GID table, as the gid updating
mechanism relies on ib_device_get_netdev(), which would return
the backup slave.
Send an MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE
event when activating the lag, additionally to when modifying
the lag. This ensures that eventually the active netdevice is
stored in the bond IB device.
When handling this event remove the GIDs of the previously
attached netdevice in this port and rescan the GIDs of the
newly attached netdevice.
This ensures that eventually the active slave netdevice is
correctly stored in the IB device port. While there might be
a brief moment where the backup slave GIDs appear in the GID
table, it will eventually stabilize with the correct GIDs
(of the bond and the active slave).
Fixes: 8d159eb211 ("RDMA/mlx5: Use IB set_netdev and get_netdev functions")
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Link: https://patch.msgid.link/91fc2cb24f63add266a528c1c702668a80416d9f.1730381292.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Support QP with out-of-order (OOO) capabilities enabled.
This allows WRs on the receiver side of the QP to be consumed OOO,
permitting the sender side to transmit messages without guaranteeing
arrival order on the receiver side.
When enabled, the completion ordering of WRs remains in-order,
regardless of the Receive WRs consumption order.
RDMA Read and RDMA Atomic operations on the responder side continue to
be executed in-order, while the ordering of data placement for RDMA
Write and Send operations is not guaranteed.
Atomic operations larger than 8 bytes are currently not supported.
Therefore, when this feature is enabled, the created QP restricts its
atomic support to 8 bytes at most.
In addition, when querying the device, a new flag is returned in
response to indicate that the Kernel supports OOO QP.
Signed-off-by: Edward Srouji <edwards@nvidia.com>
Reviewed-by: Yishai Hadas <yishaih@nvidia.com>
Link: https://patch.msgid.link/06ac609a5f358c8fb0a090d22c61a2f9329d82e6.1725362773.git.leon@kernel.org
Signed-off-by: Leon Romanovsky <leon@kernel.org>
After the cited commit below max_dest_rd_atomic and max_rd_atomic values
are being rounded down to the next power of 2. As opposed to the old
behavior and mlx4 driver where they used to be rounded up instead.
In order to stay consistent with older code and other drivers, revert to
using fls round function which rounds up to the next power of 2.
Fixes: f18e26af6a ("RDMA/mlx5: Convert modify QP to use MLX5_SET macros")
Link: https://patch.msgid.link/r/d85515d6ef21a2fa8ef4c8293dce9b58df8a6297.1728550179.git.leon@kernel.org
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Maher Sanalla <msanalla@nvidia.com>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
no_llseek had been defined to NULL two years ago, in commit 868941b144
("fs: remove no_llseek")
To quote that commit,
At -rc1 we'll need do a mechanical removal of no_llseek -
git grep -l -w no_llseek | grep -v porting.rst | while read i; do
sed -i '/\<no_llseek\>/d' $i
done
would do it.
Unfortunately, that hadn't been done. Linus, could you do that now, so
that we could finally put that thing to rest? All instances are of the
form
.llseek = no_llseek,
so it's obviously safe.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The IB layer provides a common interface to store and get net
devices associated to an IB device port (ib_device_set_netdev()
and ib_device_get_netdev()).
Previously, mlx5_ib stored and managed the associated net devices
internally.
Replace internal net device management in mlx5_ib with
ib_device_set_netdev() when attaching/detaching a net device and
ib_device_get_netdev() when retrieving the net device.
Export ib_device_get_netdev().
For mlx5 representors/PFs/VFs and lag creation we replace the netdev
assignments with the IB set/get netdev functions.
In active-backup mode lag the active slave net device is stored in the
lag itself. To assure the net device stored in a lag bond IB device is
the active slave we implement the following:
- mlx5_core: when modifying the slave of a bond we send the internal driver event
MLX5_DRIVER_EVENT_ACTIVE_BACKUP_LAG_CHANGE_LOWERSTATE.
- mlx5_ib: when catching the event call ib_device_set_netdev()
This patch also ensures the correct IB events are sent in switchdev lag.
While at it, when in multiport eswitch mode, only a single IB device is
created for all ports. The said IB device will receive all netdev events
of its VFs once loaded, thus to avoid overwriting the mapping of PF IB
device to PF netdev, ignore NETDEV_REGISTER events if the ib device has
already been mapped to a netdev.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-6-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
phys_port_cnt of the IB device must be initialized before calling
ib_device_set_netdev().
Previously, phys_port_cnt was initialized in the mlx5_ib init function.
Remove this initialization to allow setting it separately, providing
the flexibility to call ib_device_set_netdev before registering the
IB device.
Signed-off-by: Chiara Meiohas <cmeiohas@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-4-michaelgur@nvidia.com
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Report the upper device's state as the RDMA port state only in RoCE LAG or
switchdev LAG.
Fixes: 27f9e0ccb6 ("net/mlx5: Lag, Add single RDMA device in multiport mode")
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-3-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
Check if RoCE LAG is active before calling the LAG layer for netdev.
This clarifies if LAG is active. No behavior changes with this patch.
Signed-off-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Michael Guralnik <michaelgur@nvidia.com>
Link: https://patch.msgid.link/20240909173025.30422-2-michaelgur@nvidia.com
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Leon Romanovsky <leon@kernel.org>