linux

Commit Graph

Author	SHA1	Message	Date
Linus Torvalds	deb879faa9	drm next part 2 for 6.19-rc1 vfio: - add a vfio_pci variant driver for Intel xe/i915 display: - add plane color management support xe: - Add scope-based cleanup helper for runtime PM - vfio xe driver prerequisites and exports - fix vfio link error - Fix a memory leak - Fix a 64-bit division - vf migration fix - LRC pause fix -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmkyMUgACgkQDHTzWXnE hr69pg/9EWjh7qVGk9ZIpYc9AW42UzWwOVBX/HWkuQvmfxUUBqtA3IuP0dGGmPUn QbtbetbRvlCaXwEoZpPh1nzrXA2AGFxgHErYMO5BfwquyBcfpwTWZ9T15ptceL/3 aw2l63aH1R2/yxCRfHFIdwAmq1bThqdh5IkjjbE3im0V0lHT2Uo/jhmf/EWCNWol LlPgYxHpfBIzhtFYUcniaXxs9vOSk49AY+ObpPpuvks8OWoaaTcKYWlUCHr/X1ip OnWB4NGraTzx4l44vqdRvRL5/KPY7N2IcAxU7rXFTacWp6UoESph5DCYLsPREONb OsK1pVbAsKATobeoAC9J+utILhfDmKM8Z7eSAlNE+X+nk/BKu4h9Pp1TnKfo7bCz 0tER/OrsqnYMfxj1PawT3xpf/KUWkL0aqnRJpmA2cvJqTz8Qnb4h6kRQp1iAKp80 XaBL1v0uzVE/J4ffuA5bzkT71w3hjN5ytLyEe7h1Y43E/jxyQgyTIHM8cX/UrreJ RboaakyoTv1u1xrd9Mzx4WCzwKryH+JFY2nekAC3YnSCcGYnSScSNM/ARTrYC2pf wNbWBvkq7ZFy9eybaZQ/zaSYyVO7yQDjdCAqO+SA+xfRuwF41uiADJptyC+FgMPw nIBaeid314tJQ9uGNPJH0f2BzLzSvH569trUp/7hbOYWC69XeQI= =jyth -----END PGP SIGNATURE----- Merge tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel Pull more drm updates from Dave Airlie: "There was some additional intel code for color operations we wanted to land. However I discovered I missed a pull for the xe vfio driver which I had sorted into 6.20 in my brain, until Thomas mentioned it. This contains the xe vfio code, a bunch of xe fixes that were waiting and the i915 color management support. I'd like to include it as part of keeping the two main vendors on the same page and giving a good cross-driver experience for userspace when it starts using it. vfio: - add a vfio_pci variant driver for Intel xe/i915 display: - add plane color management support xe: - Add scope-based cleanup helper for runtime PM - vfio xe driver prerequisites and exports - fix vfio link error - Fix a memory leak - Fix a 64-bit division - vf migration fix - LRC pause fix" * tag 'drm-next-2025-12-05' of https://gitlab.freedesktop.org/drm/kernel: (25 commits) drm/i915/color: Enable Plane Color Pipelines drm/i915/color: Add 3D LUT to color pipeline drm/i915/color: Add registers for 3D LUT drm/i915/color: Program Plane Post CSC Registers drm/i915/color: Program Pre-CSC registers drm/i915/color: Add framework to program PRE/POST CSC LUT drm/i915: Add register definitions for Plane Post CSC drm/i915: Add register definitions for Plane Degamma drm/i915/color: Add plane CTM callback for D12 and beyond drm/i915/color: Preserve sign bit when int_bits is Zero drm/i915/color: Add framework to program CSC drm/i915/color: Create a transfer function color pipeline drm/i915/color: Add helper to create intel colorop drm/i915: Add intel_color_op drm/i915/display: Add identifiers for driver specific blocks drm/xe/pf: fix VFIO link error drm/xe: Protect against unset LRC when pausing submissions drm/xe/vf: Start re-emission from first unsignaled job during VF migration drm/xe/pf: Use div_u64 when calculating GGTT profile drm/xe: Fix memory leak when handling pagefault vma ...	2025-12-04 19:42:53 -08:00
Linus Torvalds	056daec292	iommufd 6.19 pull request - Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO. This is the first step to broader DMABUF support in iommufd, right now it only works with VFIO. This closes the last functional gap with classic VFIO type 1 to safely support PCI peer to peer DMA by mapping the VFIO device's MMIO into the IOMMU. - Relax SMMUv3 restrictions on nesting domains to better support qemu's sequence to have an identity mapping before the vSID is established. -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaS8DrgAKCRCFwuHvBreF YY/kAP0Q1s7LkVb83uQb8kIW3xKzEnFNTlhrSSGV5UBuYLbaDgD+J+y+4VrSkJem 85LMipmzaoZdHqtxMhQWrlYbZMr9TAM= =BacK -----END PGP SIGNATURE----- Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd Pull iommufd updates from Jason Gunthorpe: "This is a pretty consequential cycle for iommufd, though this pull is not too big. It is based on a shared branch with VFIO that introduces VFIO_DEVICE_FEATURE_DMA_BUF a DMABUF exporter for VFIO device's MMIO PCI BARs. This was a large multiple series journey over the last year and a half. Based on that work IOMMUFD gains support for VFIO DMABUF's in its existing IOMMU_IOAS_MAP_FILE, which closes the last major gap to support PCI peer to peer transfers within VMs. In Joerg's iommu tree we have the "generic page table" work which aims to consolidate all the duplicated page table code in every iommu driver into a single algorithm. This will be used by iommufd to implement unique page table operations to start adding new features and improve performance. In here: - Expand IOMMU_IOAS_MAP_FILE to accept a DMABUF exported from VFIO. This is the first step to broader DMABUF support in iommufd, right now it only works with VFIO. This closes the last functional gap with classic VFIO type 1 to safely support PCI peer to peer DMA by mapping the VFIO device's MMIO into the IOMMU. - Relax SMMUv3 restrictions on nesting domains to better support qemu's sequence to have an identity mapping before the vSID is established" * tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: iommu/arm-smmu-v3-iommufd: Allow attaching nested domain for GBPA cases iommufd/selftest: Add some tests for the dmabuf flow iommufd: Accept a DMABUF through IOMMU_IOAS_MAP_FILE iommufd: Have iopt_map_file_pages convert the fd to a file iommufd: Have pfn_reader process DMABUF iopt_pages iommufd: Allow MMIO pages in a batch iommufd: Allow a DMABUF to be revoked iommufd: Do not map/unmap revoked DMABUFs iommufd: Add DMABUF to iopt_pages vfio/pci: Add vfio_pci_dma_buf_iommufd_map()	2025-12-04 18:50:11 -08:00
Linus Torvalds	a3ebb59eee	VFIO updates for v6.19-rc1 - Move libvfio selftest artifacts in preparation of more tightly coupled integration with KVM selftests. (David Matlack) - Fix comment typo in mtty driver. (Chu Guangqing) - Support for new hardware revision in the hisi_acc vfio-pci variant driver where the migration registers can now be accessed via the PF. When enabled for this support, the full BAR can be exposed to the user. (Longfang Liu) - Fix vfio cdev support for VF token passing, using the correct size for the kernel structure, thereby actually allowing userspace to provide a non-zero UUID token. Also set the match token callback for the hisi_acc, fixing VF token support for this this vfio-pci variant driver. (Raghavendra Rao Ananta) - Introduce internal callbacks on vfio devices to simplify and consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO data, removing various ioctl intercepts with a more structured solution. (Jason Gunthorpe) - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions to be exposed through dma-buf objects with lifecycle managed through move operations. This enables low-level interactions such as a vfio-pci based SPDK drivers interacting directly with dma-buf capable RDMA devices to enable peer-to-peer operations. IOMMUFD is also now able to build upon this support to fill a long standing feature gap versus the legacy vfio type1 IOMMU backend with an implementation of P2P support for VM use cases that better manages the lifecycle of the P2P mapping. (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy) - Convert eventfd triggering for error and request signals to use RCU mechanisms in order to avoid a 3-way lockdep reported deadlock issue. (Alex Williamson) - Fix a 32-bit overflow introduced via dma-buf support manifesting with large DMA buffers. (Alex Mastro) - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on fault rather than at mmap time. This conversion serves both to make use of huge PFNMAPs but also to both avoid corrected RAS events during reset by now being subject to vfio-pci-core's use of unmap_mapping_range(), and to enable a device readiness test after reset. (Ankit Agrawal) - Refactoring of vfio selftests to support multi-device tests and split code to provide better separation between IOMMU and device objects. This work also enables a new test suite addition to measure parallel device initialization latency. (David Matlack) -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkvV3IRHGFsZXhAc2hh emJvdC5vcmcACgkQI5ubbjuwiyIpIQ/9GwpjLH5Vdv0v2d9mkHmZIWFpG/tr3zJa +spQqOjO0etASc67PtIJArT9pWib+s6O8OaG7iFrdNR65HCSsXSZbIGbMThPODfy DdDj1ipAqMVwcaCZT8un2N8Sktu9YpFQMvc5IoXWWYhw88vili7bBx+OTrEFV2T0 6qQijSBdhw1TXVFHG6BGSmqmisyMepIebA6GmPWdfYu6BfoWBYMdcMjDwd1J61Q5 DDwFRzn/Dz2Tvb1jbXiiRMRuFIuegFQii+wtd30S/cRPFZhZLWzc+drimC6oOFiQ qL19vQQsBPnLtGvch40HsET/AbY5w0pLCkYX5qacxP3sq27+N+KuotzCvbnVMN+H e2BqOCujyoce8z1Br6BzV71Lr2yzPDcc5pXTuEuuBT+J/ptOY8hfEikOj85s5Wzj aKsTrdDRGMrn/o11NkGSzYwFcMs9MxCX9mo98U6OkWDr0+cmPLf4LGZgpJudWg4E POUlzPpnzJrTlX5d+OqCdKJG0a1hTlTa2udzRa5hCDANHaZWLoAssfgSEKfV9xt1 PzOMf0UIJmPJmFcw/OpMO72/5xp8O4WslJS0ulSm6vrAJDtutLApHZ7bJ44KniNd 4vte+gOjyZY8ibTDKRULhXVlCDxkEnZjRBbApgI9HJD61IElOzjqohRuRx77J09B 7c8OSLI8d1U= =tpee -----END PGP SIGNATURE----- Merge tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Move libvfio selftest artifacts in preparation of more tightly coupled integration with KVM selftests (David Matlack) - Fix comment typo in mtty driver (Chu Guangqing) - Support for new hardware revision in the hisi_acc vfio-pci variant driver where the migration registers can now be accessed via the PF. When enabled for this support, the full BAR can be exposed to the user (Longfang Liu) - Fix vfio cdev support for VF token passing, using the correct size for the kernel structure, thereby actually allowing userspace to provide a non-zero UUID token. Also set the match token callback for the hisi_acc, fixing VF token support for this this vfio-pci variant driver (Raghavendra Rao Ananta) - Introduce internal callbacks on vfio devices to simplify and consolidate duplicate code for generating VFIO_DEVICE_GET_REGION_INFO data, removing various ioctl intercepts with a more structured solution (Jason Gunthorpe) - Introduce dma-buf support for vfio-pci devices, allowing MMIO regions to be exposed through dma-buf objects with lifecycle managed through move operations. This enables low-level interactions such as a vfio-pci based SPDK drivers interacting directly with dma-buf capable RDMA devices to enable peer-to-peer operations. IOMMUFD is also now able to build upon this support to fill a long standing feature gap versus the legacy vfio type1 IOMMU backend with an implementation of P2P support for VM use cases that better manages the lifecycle of the P2P mapping (Leon Romanovsky, Jason Gunthorpe, Vivek Kasireddy) - Convert eventfd triggering for error and request signals to use RCU mechanisms in order to avoid a 3-way lockdep reported deadlock issue (Alex Williamson) - Fix a 32-bit overflow introduced via dma-buf support manifesting with large DMA buffers (Alex Mastro) - Convert nvgrace-gpu vfio-pci variant driver to insert mappings on fault rather than at mmap time. This conversion serves both to make use of huge PFNMAPs but also to both avoid corrected RAS events during reset by now being subject to vfio-pci-core's use of unmap_mapping_range(), and to enable a device readiness test after reset (Ankit Agrawal) - Refactoring of vfio selftests to support multi-device tests and split code to provide better separation between IOMMU and device objects. This work also enables a new test suite addition to measure parallel device initialization latency (David Matlack) * tag 'vfio-v6.19-rc1' of https://github.com/awilliam/linux-vfio: (65 commits) vfio: selftests: Add vfio_pci_device_init_perf_test vfio: selftests: Eliminate INVALID_IOVA vfio: selftests: Split libvfio.h into separate header files vfio: selftests: Move vfio_selftests_*() helpers into libvfio.c vfio: selftests: Rename vfio_util.h to libvfio.h vfio: selftests: Stop passing device for IOMMU operations vfio: selftests: Move IOVA allocator into iova_allocator.c vfio: selftests: Move IOMMU library code into iommu.c vfio: selftests: Rename struct vfio_dma_region to dma_region vfio: selftests: Upgrade driver logging to dev_err() vfio: selftests: Prefix logs with device BDF where relevant vfio: selftests: Eliminate overly chatty logging vfio: selftests: Support multiple devices in the same container/iommufd vfio: selftests: Introduce struct iommu vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode vfio: selftests: Allow passing multiple BDFs on the command line vfio: selftests: Split run.sh into separate scripts vfio: selftests: Move run.sh into scripts directory vfio/nvgrace-gpu: wait for the GPU mem to be ready vfio/nvgrace-gpu: Inform devmem unmapped after reset ...	2025-12-04 18:42:48 -08:00
Linus Torvalds	1b5dd29869	vfs-6.19-rc1.fd_prepare.fs -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZwAKCRCRxhvAZXjc op0AAP4oNVJkFyvgKoPos5K2EGFB8M7merGhpYtsOoeg8UK6OwD/UySQErHsXQDR sUDDa5uFOhfrkcfM8REtAN4wF8p5qAc= =QgFD -----END PGP SIGNATURE----- Merge tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull fd prepare updates from Christian Brauner: "This adds the FD_ADD() and FD_PREPARE() primitive. They simplify the common pattern of get_unused_fd_flags() + create file + fd_install() that is used extensively throughout the kernel and currently requires cumbersome cleanup paths. FD_ADD() - For simple cases where a file is installed immediately: fd = FD_ADD(O_CLOEXEC, vfio_device_open_file(device)); if (fd < 0) vfio_device_put_registration(device); return fd; FD_PREPARE() - For cases requiring access to the fd or file, or additional work before publishing: FD_PREPARE(fdf, O_CLOEXEC, sync_file->file); if (fdf.err) { fput(sync_file->file); return fdf.err; } data.fence = fd_prepare_fd(fdf); if (copy_to_user((void __user )arg, &data, sizeof(data))) return -EFAULT; return fd_publish(fdf); The primitives are centered around struct fd_prepare. FD_PREPARE() encapsulates all allocation and cleanup logic and must be followed by a call to fd_publish() which associates the fd with the file and installs it into the caller's fdtable. If fd_publish() isn't called, both are deallocated automatically. FD_ADD() is a shorthand that does fd_publish() immediately and never exposes the struct to the caller. I've implemented this in a way that it's compatible with the cleanup infrastructure while also being usable separately. IOW, it's centered around struct fd_prepare which is aliased to class_fd_prepare_t and so we can make use of all the basica guard infrastructure" tag 'vfs-6.19-rc1.fd_prepare.fs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (42 commits) io_uring: convert io_create_mock_file() to FD_PREPARE() file: convert replace_fd() to FD_PREPARE() vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() tty: convert ptm_open_peer() to FD_ADD() ntsync: convert ntsync_obj_get_fd() to FD_PREPARE() media: convert media_request_alloc() to FD_PREPARE() hv: convert mshv_ioctl_create_partition() to FD_ADD() gpio: convert linehandle_create() to FD_PREPARE() pseries: port papr_rtas_setup_file_interface() to FD_ADD() pseries: convert papr_platform_dump_create_handle() to FD_ADD() spufs: convert spufs_gang_open() to FD_PREPARE() papr-hvpipe: convert papr_hvpipe_dev_create_handle() to FD_PREPARE() spufs: convert spufs_context_open() to FD_PREPARE() net/socket: convert __sys_accept4_file() to FD_ADD() net/socket: convert sock_map_fd() to FD_ADD() net/kcm: convert kcm_ioctl() to FD_PREPARE() net/handshake: convert handshake_nl_accept_doit() to FD_PREPARE() secretmem: convert memfd_secret() to FD_ADD() memfd: convert memfd_create() to FD_ADD() bpf: convert bpf_token_create() to FD_PREPARE() ...	2025-12-01 17:32:07 -08:00
Michał Winiarski	1f5556ec8b	vfio/xe: Add device specific vfio_pci driver variant for Intel graphics In addition to generic VFIO PCI functionality, the driver implements VFIO migration uAPI, allowing userspace to enable migration for Intel Graphics SR-IOV Virtual Functions. The driver binds to VF device and uses API exposed by Xe driver to transfer the VF migration data under the control of PF device. Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Alex Williamson <alex@shazbot.org> Link: https://patch.msgid.link/20251127093934.1462188-5-michal.winiarski@intel.com Link: https://lore.kernel.org/all/20251128125322.34edbeaf.alex@shazbot.org/ Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> (cherry picked from commit 2e38c50ae4929f0b954fee69d428db7121452867) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>	2025-12-01 09:45:48 +01:00
Ankit Agrawal	a23b10608d	vfio/nvgrace-gpu: wait for the GPU mem to be ready Speculative prefetches from CPU to GPU memory until the GPU is ready after reset can cause harmless corrected RAS events to be logged on Grace systems. It is thus preferred that the mapping not be re-established until the GPU is ready post reset. The GPU readiness can be checked through BAR0 registers similar to the checking at the time of device probe. It can take several seconds for the GPU to be ready. So it is desirable that the time overlaps as much of the VM startup as possible to reduce impact on the VM bootup time. The GPU readiness state is thus checked on the first fault/huge_fault request or read/write access which amortizes the GPU readiness time. The first fault and read/write checks the GPU state when the reset_done flag - which denotes whether the GPU has just been reset. The memory_lock is taken across map/access to avoid races with GPU reset. Also check if the memory is enabled, before waiting for GPU to be ready. Otherwise the readiness check would block for 30s. Lastly added PM handling wrapping on read/write access. Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:09:26 -07:00
Ankit Agrawal	dfe765499a	vfio/nvgrace-gpu: Inform devmem unmapped after reset Introduce a new flag reset_done to notify that the GPU has just been reset and the mapping to the GPU memory is zapped. Implement the reset_done handler to set this new variable. It will be used later in the patches to wait for the GPU memory to be ready before doing any mapping or access. Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:07:26 -07:00
Ankit Agrawal	7d055071d7	vfio/nvgrace-gpu: split the code to wait for GPU ready Split the function that check for the GPU device being ready on the probe. Move the code to wait for the GPU to be ready through BAR0 register reads to a separate function. This would help reuse the code. This also fixes a bug where the return status in case of timeout gets overridden by return from pci_enable_device. With the fix, a timeout generate an error as initially intended. Fixes: `d85f69d520` ("vfio/nvgrace-gpu: Check the HBM training and C2C link status") Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:07:25 -07:00
Ankit Agrawal	7f5764e179	vfio: use vfio_pci_core_setup_barmap to map bar in mmap Remove code duplication in vfio_pci_core_mmap by calling vfio_pci_core_setup_barmap to perform the bar mapping. No functional change is intended. Cc: Donald Dutile <ddutile@redhat.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-4-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:07:25 -07:00
Ankit Agrawal	9db65489b8	vfio/nvgrace-gpu: Add support for huge pfnmap NVIDIA's Grace based systems have large device memory. The device memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu module could make use of the huge PFNMAP support added in mm [1]. To make use of the huge pfnmap support, fault/huge_fault ops based mapping mechanism needs to be implemented. Currently nvgrace-gpu module relies on remap_pfn_range to do the mapping during VM bootup. Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn to setup the mapping. Moreover to enable huge pfnmap, nvgrace-gpu module is updated by adding huge_fault ops implementation. The implementation establishes mapping according to the order request. Note that if the PFN or the VMA address is unaligned to the order, the mapping fallbacks to the PTE level. Link: https://lore.kernel.org/all/20240826204353.2228736-1-peterx@redhat.com/ [1] Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-3-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:07:25 -07:00
Ankit Agrawal	9b92bc7554	vfio: refactor vfio_pci_mmap_huge_fault function Refactor vfio_pci_mmap_huge_fault to take out the implementation to map the VMA to the PTE/PMD/PUD as a separate function. Export the new function to be used by nvgrace-gpu module. Move the alignment check code to verify that pfn and VMA VA is aligned to the page order to the header file and make it inline. No functional change is intended. Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-2-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:07:25 -07:00
Alex Williamson	98693e0897	vfio/pci: Use RCU for error/request triggers to avoid circular locking Thanks to a device generating an ACS violation during bus reset, lockdep reported the following circular locking issue: CPU0: SET_IRQS (MSI/X): holds igate, acquires memory_lock CPU1: HOT_RESET: holds memory_lock, acquires pci_bus_sem CPU2: AER: holds pci_bus_sem, acquires igate This results in a potential 3-way deadlock. Remove the pci_bus_sem->igate leg of the triangle by using RCU to peek at the eventfd rather than locking it with igate. Fixes: `3be3a074cf` ("vfio-pci: Don't use device_lock around AER interrupt setup") Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-28 10:04:27 -07:00
Christian Brauner	5f3ea1c201	vfio: convert vfio_group_ioctl_get_device_fd() to FD_ADD() Link: https://patch.msgid.link/20251123-work-fd-prepare-v4-43-b6efa1706cfd@kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>	2025-11-28 12:42:36 +01:00
Jason Gunthorpe	96ce2aeb15	vfio/pci: Add vfio_pci_dma_buf_iommufd_map() This function is used to establish the "private interconnect" between the VFIO DMABUF exporter and the iommufd DMABUF importer. This is intended to be a temporary API until the core DMABUF interface is improved to natively support a private interconnect and revocable negotiation. This function should only be called by iommufd when trying to map a DMABUF. For now iommufd will only support VFIO DMABUFs. The following improvements are needed in the DMABUF API to generically support more exporters with iommufd/kvm type importers that cannot use the DMA API: 1) Revoke semantics. VFIO needs to be able to prevent access to the MMIO during FLR, and so it will use dma_buf_move_notify() to prevent access. iommmufd does not support fault handling so it cannot implement the full move_notify. Instead if revoke is negotiated the exporter promises not to use move_notify() unless the importer can experiance failures. iommufd will unmap the dmabuf from the iommu page tables while it is revoked. 2) Private interconnect negotiation. iommufd will only be able to map a "private interconnect" that provides a phys_addr_t and a struct p2pdma_provider * to describe the memory. It cannot use a DMA mapped scatterlist since it is directly calling iommu_map(). 3) NULL device during dma_buf_dynamic_attach(). Since iommufd doesn't use the DMA API it doesn't have a DMAable struct device to pass here. Link: https://patch.msgid.link/r/1-v2-b2c110338e3f+5c2-iommufd_dmabuf_jgg@nvidia.com Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Shuai Xue <xueshuai@linux.alibaba.com> Acked-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>	2025-11-25 11:30:15 -04:00
Alex Williamson	fa804aa4ac	[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkf53sRHGFsZXhAc2hh emJvdC5vcmcACgkQI5ubbjuwiyJv5w//TdfL5p6yz8O9CxJCQrm0W6raqDx+LE7u MNyCktSdokkPKSz/ms10vgl9CGqpVzDHNlgmVBGkAFQaRYQKUgMryp1IQ0jlspz0 Ee1zy6AtlMemAyL1bybnk6yvc2nh/xa3LHa4FJ6sgL3KKnt9KjpY4sGGV/KlNfJV lYs+20+NyNU1rgyPtkHcrCrcYTMkDumvHDsn51O8Zx12b++qkZuf3b+mcWNNlNam DJl58O9tio0ol5a4rf63BxgPROgEVqs4G4rSRelJqr6g8IIFttihplhyZ83af9sC jtV0NEqsWt0nrKZtg1N9IIBgfQho5eamB99J0cPU0dhZSKXzBwIalBUb6zKZeVVY QEN6ZLoynpKRpZ1bhe3EhduNE1LOm2+wOJ7s93gSAtvsSXHHGDf1cXHL5CzKMLwG 76fkO6b0m60mwXyjTgEVjqE9GTcgZLc9SKI9GN303v53W5Y1BOtW9B4QyX9eEPui qqtSbqXMsYx95lJASkjwE+u2b33mFrks8UC7Xg1nZkJWLy6hDPbQ6AIcM0mHd1EE 2UJNJAzeUtzr0Vd3B92RkT0BPY299XlUUp50In42/g5y6IMSO7R/jZXrQyEXEDez dsqedYJE0vzO07pfkRYh1TJtwIl0WfdNnqClOKSlolfSn0vzI5qz9RKkpHM5T8Ou FIIb7Bf3hvs= =li88 -----END PGP SIGNATURE----- Merge tag 'vfio-v6.19-dma-buf-v9+' into v6.19/vfio/next [v9] vfio/pci: Allow MMIO regions to be exported through dma-buf https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 21:20:00 -07:00
Jason Gunthorpe	5415d887db	vfio/nvgrace: Support get_dmabuf_phys Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on the PCI bar. This demonstrates a DMABUF that follows the region info report to only allow mapping parts of the region that are mmapable. Since the BAR is power of two sized and the "CXL" region is just page aligned the there can be a padding region at the end that is not mmaped or passed into the DMABUF. The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI MMIO, they actually run over the CXL-like coherent interconnect and for the purposes of DMA behave identically to DRAM. We don't try to model this distinction between true PCI BAR memory that takes a real PCI path and the "CXL" memory that takes a different path in the p2p framework for now. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 21:15:17 -07:00
Leon Romanovsky	5d74781ebc	vfio/pci: Add dma-buf export support for MMIO regions Add support for exporting PCI device MMIO regions through dma-buf, enabling safe sharing of non-struct page memory with controlled lifetime management. This allows RDMA and other subsystems to import dma-buf FDs and build them into memory regions for PCI P2P operations. The implementation provides a revocable attachment mechanism using dma-buf move operations. MMIO regions are normally pinned as BARs don't change physical addresses, but access is revoked when the VFIO device is closed or a PCI reset is issued. This ensures kernel self-defense against potentially hostile userspace. Currently VFIO can take MMIO regions from the device's BAR and map them into a PFNMAP VMA with special PTEs. This mapping type ensures the memory cannot be used with things like pin_user_pages(), hmm, and so on. In practice only the user process CPU and KVM can safely make use of these VMA. When VFIO shuts down these VMAs are cleaned by unmap_mapping_range() to prevent any UAF of the MMIO beyond driver unbind. However, VFIO type 1 has an insecure behavior where it uses follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back into the IOMMU. This has a long history of enabling P2P DMA inside VMs, but has serious lifetime problems by allowing a UAF of the MMIO after the VFIO driver has been unbound. Introduce DMABUF as a new safe way to export a FD based handle for the MMIO regions. This can be consumed by existing DMABUF importers like RDMA or DRM without opening an UAF. A following series will add an importer to iommufd to obsolete the type 1 code and allow safe UAF-free MMIO P2P in VM cases. DMABUF has a built in synchronous invalidation mechanism called move_notify. VFIO keeps track of all drivers importing its MMIO and can invoke a synchronous invalidation callback to tell the importing drivers to DMA unmap and forget about the MMIO pfns. This process is being called revoke. This synchronous invalidation fully prevents any lifecycle problems. VFIO will do this before unbinding its driver ensuring there is no UAF of the MMIO beyond the driver lifecycle. Further, VFIO has additional behavior to block access to the MMIO during things like Function Level Reset. This is because some poor platforms may experience a MCE type crash when touching MMIO of a PCI device that is undergoing a reset. Today this is done by using unmap_mapping_range() on the VMAs. Extend that into the DMABUF world and temporarily revoke the MMIO from the DMABUF importers during FLR as well. This will more robustly prevent an errant P2P from possibly upsetting the platform. A DMABUF FD is a preferred handle for MMIO compared to using something like a pgmap because: - VFIO is supported, including its P2P feature, on archs that don't support pgmap - PCI devices have all sorts of BAR sizes, including ones smaller than a section so a pgmap cannot always be created - It is undesirable to waste a lot of memory for struct pages, especially for a case like a GPU with ~100GB of BAR size - We want a synchronous revoke semantic to support FLR with light hardware requirements Use the P2P subsystem to help generate the DMA mapping. This is a significant upgrade over the abuse of dma_map_resource() that has historically been used by DMABUF exporters. Experience with an OOT version of this patch shows that real systems do need this. This approach deals with all the P2P scenarios: - Non-zero PCI bus_offset - ACS flags routing traffic to the IOMMU - ACS flags that bypass the IOMMU - though vfio noiommu is required to hit this. There will be further work to formalize the revoke semantic in DMABUF. For now this acts like a move_notify dynamic exporter where importer fault handling will get a failure when they attempt to map. This means that only fully restartable fault capable importers can import the VFIO DMABUFs. A future revoke semantic should open this up to more HW as the HW only needs to invalidate, not handle restartable faults. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 21:12:19 -07:00
Leon Romanovsky	35c3503908	vfio/pci: Enable peer-to-peer DMA transactions by default Make sure that all VFIO PCI devices have peer-to-peer capabilities enables, so we would be able to export their MMIO memory through DMABUF, VFIO has always supported P2P mappings with itself. VFIO type 1 insecurely reads PFNs directly out of a VMA's PTEs and programs them into the IOMMU allowing any two VFIO devices to perform P2P to each other. All existing VMMs use this capability to export P2P into a VM where the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK are also known to make use of this, though less frequently. As a first step to more properly integrating VFIO with the P2P subsystem unconditionally enable P2P support for VFIO PCI devices. The struct p2pdma_provider will act has a handle to the P2P subsystem to do things like DMA mapping. While real PCI devices have to support P2P (they can't even tell if an IOVA is P2P or not) there may be fake PCI devices that may trigger some kind of catastrophic system failure. To date VFIO has never tripped up on such a case, but if one is discovered the plan is to add a PCI quirk and have pcim_p2pdma_init() fail. This will fully block the broken device throughout any users of the P2P subsystem in the kernel. Thus P2P through DMABUF will follow the historical VFIO model and be unconditionally enabled by vfio-pci. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 12:02:46 -07:00
Vivek Kasireddy	47d13c939d	vfio/pci: Share the core device pointer while invoking feature functions There is no need to share the main device pointer (struct vfio_device *) with all the feature functions as they only need the core device pointer. Therefore, extract the core device pointer once in the caller (vfio_pci_core_ioctl_feature) and share it instead. Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 12:02:40 -07:00
Vivek Kasireddy	64a5dedcff	vfio: Export vfio device get and put registration helpers These helpers are useful for managing additional references taken on the device from other associated VFIO modules. Original-patch-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-7-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-20 12:02:32 -07:00
Jason Gunthorpe	56c069307d	vfio: Remove the get_region_info op No driver uses it now, all are using get_region_info_caps(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/22-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:06:41 -07:00
Jason Gunthorpe	dc10734610	vfio: Move the remaining drivers to get_region_info_caps Remove the duplicate code and change info to a pointer. caps are not used. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/21-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:06:41 -07:00
Jason Gunthorpe	182c62861b	vfio/platform: Convert to get_region_info_caps Remove the duplicate code and change info to a pointer. caps are not used. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:06:41 -07:00
Jason Gunthorpe	1b0ecb5baf	vfio/pci: Convert all PCI drivers to get_region_info_caps Since the core function signature changes it has to flow up to all drivers. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/19-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:06:40 -07:00
Jason Gunthorpe	775f726a74	vfio: Add get_region_info_caps op This op does the copy to/from user for the info and can return back a cap chain through a vfio_info_cap * result. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/15-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	f978595038	vfio: Require drivers to implement get_region_info Remove the fallback through the ioctl callback, no drivers use this now. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/14-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	b9827eff6b	vfio/cdx: Provide a get_region_info op Change the signature of vfio_cdx_ioctl_get_region_info() and hook it to the op. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	6cdae5d0c3	vfio/fsl: Provide a get_region_info op Move it out of vfio_fsl_mc_ioctl() and re-indent it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/10-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	d4635df279	vfio/platform: Provide a get_region_info op Move it out of vfio_platform_ioctl() and re-indent it. Add it to all platform drivers. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/9-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	f3fddb71dd	vfio/pci: Fill in the missing get_region_info ops Now that every variant driver provides a get_region_info op remove the ioctl based dispatch from vfio_pci_core_ioctl(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	5ac7206474	vfio/nvgrace: Convert to the get_region_info op Change the signature of nvgrace_gpu_ioctl_get_region_info() Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	c044eefa47	vfio/virtio: Convert to the get_region_info op Remove virtiovf_vfio_pci_core_ioctl() and change the signature of virtiovf_pci_ioctl_get_region_info(). Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/3-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:02 -07:00
Jason Gunthorpe	e238f147d5	vfio/hisi: Convert to the get_region_info op Change the function signature of hisi_acc_vfio_pci_ioctl() and re-indent it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Acked-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 15:05:01 -07:00
Jason Gunthorpe	113557b040	vfio: Provide a get_region_info op Instead of hooking the general ioctl op, have the core code directly decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it. This is intended to allow mechanical changes to the drivers to pull their VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve the function signature to consolidate more code. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v2-2a9e24d62f1b+e10a-vfio_get_region_info_op_jgg@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-12 14:58:23 -07:00
Raghavendra Rao Ananta	0ed3a30fd9	hisi_acc_vfio_pci: Add .match_token_uuid callback in hisi_acc_vfio_pci_migrn_ops The commit, <86624ba3b522> ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") accidentally ignored including the .match_token_uuid callback in the hisi_acc_vfio_pci_migrn_ops struct. Introduce the missed callback here. Fixes: `86624ba3b5` ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Suggested-by: Longfang Liu <liulongfang@huawei.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-3-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-06 14:42:04 -07:00
Raghavendra Rao Ananta	2f03f21fe7	vfio: Fix ksize arg while copying user struct in vfio_df_ioctl_bind_iommufd() For the cases where user includes a non-zero value in 'token_uuid_ptr' field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user() in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed, copy_struct_from_user() expects the newly introduced field to be zero-ed, which would be incorrect in this case. Fix this by passing the actual size of the kernel struct. If working with a newer userspace, copy_struct_from_user() would copy the 'token_uuid_ptr' field, and if working with an old userspace, it would zero out this field, thus still retaining backward compatibility. Fixes: `86624ba3b5` ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-06 14:42:04 -07:00
Longfang Liu	2131c1517f	hisi_acc_vfio_pci: adapt to new migration configuration On new platforms greater than QM_HW_V3, the migration region has been relocated from the VF to the PF. The VF's own configuration space is restored to the complete 64KB, and there is no need to divide the size of the BAR configuration space equally. The driver should be modified accordingly to adapt to the new hardware device. On the older hardware platform QM_HW_V3, the live migration configuration region is placed in the latter 32K portion of the VF's BAR2 configuration space. On the new hardware platform QM_HW_V4, the live migration configuration region also exists in the same 32K area immediately following the VF's BAR2, just like on QM_HW_V3. However, access to this region is now controlled by hardware. Additionally, a copy of the live migration configuration region is present in the PF's BAR2 configuration space. On the new hardware platform QM_HW_V4, when an older version of the driver is loaded, it behaves like QM_HW_V3 and uses the configuration region in the VF, ensuring that the live migration function continues to work normally. When the new version of the driver is loaded, it directly uses the configuration region in the PF. Meanwhile, hardware configuration disables the live migration configuration region in the VF's BAR2: reads return all 0xF values, and writes are silently ignored. Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com> Link: https://lore.kernel.org/r/20251030015744.131771-3-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-11-05 15:20:20 -07:00
Alex Mastro	ef270ec446	vfio/type1: handle DMA map/unmap up to the addressable limit Before this commit, it was possible to create end of address space mappings, but unmapping them via VFIO_IOMMU_UNMAP_DMA, replaying them for newly added iommu domains, and querying their dirty pages via VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP was broken due to bugs caused by comparisons against (iova + size) expressions, which overflow to zero. Additionally, there appears to be a page pinning leak in the vfio_iommu_type1_release() path, since vfio_unmap_unpin()'s loop body where unmap_unpin_*() are called will never be entered due to overflow of (iova + size) to zero. This commit handles DMA map/unmap operations up to the addressable limit by comparing against inclusive end-of-range limits, and changing iteration to perform relative traversals across range sizes, rather than absolute traversals across addresses. vfio_link_dma() inserts a zero-sized vfio_dma into the rb-tree, and is only used for that purpose, so discard the size from consideration for the insertion point. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: `73fa0d10d0` ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-3-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-10-28 15:54:41 -06:00
Alex Mastro	1196f1f897	vfio/type1: move iova increment to unmap_unpin_*() caller Move incrementing iova to the caller of these functions as part of preparing to handle end of address space map/unmap. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: `73fa0d10d0` ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-2-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-10-28 15:54:41 -06:00
Alex Mastro	6012379ede	vfio/type1: sanitize for overflow using check__overflow() Adopt check__overflow() functions to clearly express overflow check intent. Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Fixes: `73fa0d10d0` ("vfio: Type1 IOMMU implementation") Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251028-fix-unmap-v6-1-2542b96bcc8e@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>	2025-10-28 15:54:41 -06:00
Cédric Le Goater	451bb96328	vfio: Dump migration features under debugfs A debugfs directory was recently added for VFIO devices. Add a new "features" file under the migration sub-directory to expose which features the device supports. Signed-off-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/r/20250918121928.1921871-1-clg@redhat.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-10-06 11:22:48 -06:00
Li Zhe	d14de5b925	vfio/type1: optimize vfio_unpin_pages_remote() When vfio_unpin_pages_remote() is called with a range of addresses that includes large folios, the function currently performs individual put_pfn() operations for each page. This can lead to significant performance overheads, especially when dealing with large ranges of pages. It would be very rare for reserved PFNs and non reserved will to be mixed within the same range. So this patch utilizes the has_rsvd variable introduced in the previous patch to determine whether batch put_pfn() operations can be performed. Moreover, compared to put_pfn(), unpin_user_page_range_dirty_lock() is capable of handling large folio scenarios more efficiently. The performance test results for completing the 16G VFIO IOMMU DMA unmapping are as follows. Base(v6.16): ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO UNMAP DMA in 0.141 s (113.7 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO UNMAP DMA in 0.307 s (52.2 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO UNMAP DMA in 0.135 s (118.6 GB/s) With this patchset: ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO UNMAP DMA in 0.044 s (363.2 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO UNMAP DMA in 0.289 s (55.3 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO UNMAP DMA in 0.044 s (361.3 GB/s) For large folio, we achieve an over 67% performance improvement in the VFIO UNMAP DMA item. For small folios, the performance test results appear to show a slight improvement. Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250814064714.56485-6-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-10-06 11:21:27 -06:00
Li Zhe	089722e893	vfio/type1: introduce a new member has_rsvd for struct vfio_dma Introduce a new member has_rsvd for struct vfio_dma. This member is used to indicate whether there are any reserved or invalid pfns in the region represented by this vfio_dma. If it is true, it indicates that there is at least one pfn in this region that is either reserved or invalid. Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/r/20250814064714.56485-5-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-10-06 11:21:26 -06:00
Li Zhe	d10872050f	vfio/type1: batch vfio_find_vpfn() in function vfio_unpin_pages_remote() The function vpfn_pages() can help us determine the number of vpfn nodes on the vpfn rb tree within a specified range. This allows us to avoid searching for each vpfn individually in the function vfio_unpin_pages_remote(). This patch batches the vfio_find_vpfn() calls in function vfio_unpin_pages_remote(). Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Link: https://lore.kernel.org/r/20250814064714.56485-4-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-10-06 11:21:26 -06:00
Li Zhe	f6c84a52cc	vfio/type1: optimize vfio_pin_pages_remote() When vfio_pin_pages_remote() is called with a range of addresses that includes large folios, the function currently performs individual statistics counting operations for each page. This can lead to significant performance overheads, especially when dealing with large ranges of pages. Batch processing of statistical counting operations can effectively enhance performance. In addition, the pages obtained through longterm GUP are neither invalid nor reserved. Therefore, we can reduce the overhead associated with some calls to function is_invalid_reserved_pfn(). The performance test results for completing the 16G VFIO IOMMU DMA mapping are as follows. Base(v6.16): ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.049 s (328.5 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.268 s (59.6 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.051 s (310.9 GB/s) With this patch: ------- AVERAGE (MADV_HUGEPAGE) -------- VFIO MAP DMA in 0.025 s (629.8 GB/s) ------- AVERAGE (MAP_POPULATE) -------- VFIO MAP DMA in 0.253 s (63.1 GB/s) ------- AVERAGE (HUGETLBFS) -------- VFIO MAP DMA in 0.030 s (530.5 GB/s) For large folio, we achieve an over 40% performance improvement. For small folios, the performance test results indicate a slight improvement. Signed-off-by: Li Zhe <lizhe.67@bytedance.com> Co-developed-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Eric Farman <farman@linux.ibm.com> Link: https://lore.kernel.org/r/20250814064714.56485-3-lizhe.67@bytedance.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-10-06 11:21:26 -06:00
Linus Torvalds	55a42f78ff	VFIO updates for v6.18-rc1 - Use fdinfo to expose the sysfs path of a device represented by a vfio device file. (Alex Mastro) - Mark vfio-fsl-mc, vfio-amba, and the reset functions for vfio-platform for removal as these are either orphaned or believed to be unused. (Alex Williamson) - Add reviewers for vfio-platform to save it from also being marked for removal. (Mostafa Saleh, Pranjal Shrivastava) - VFIO selftests, including basic sanity testing and minimal userspace drivers for testing against real hardware. This is also expected to provide integration with KVM selftests for KVM-VFIO interfaces. (David Matlack, Josh Hilke) - Fix drivers/cdx and vfio/cdx to build without CONFIG_GENERIC_MSI_IRQ. (Nipun Gupta) - Fix reference leak in hisi_acc. (Miaoqian Lin) - Use consistent return for unsupported device feature. (Alex Mastro) - Unwind using the correct memory free callback in vfio/pds. (Zilin Guan) - Use IRQ_DISABLE_LAZY flag to improve handling of pre-PCI2.3 INTx and resolve stalled interrupt on ppc64. (Timothy Pearson) - Enable GB300 in nvgrace-gpu vfio-pci variant driver. (Tushar Dave) - Misc: - Drop unnecessary ternary conversion in vfio/pci. (Xichao Zhao) - Grammatical fix in nvgrace-gpu. (Morduan Zang) - Update Shameer's email address. (Shameer Kolothum) - Fix document build warning. (Alex Williamson) -----BEGIN PGP SIGNATURE----- iQJPBAABCAA5FiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmjbCBcbHGFsZXgud2ls bGlhbXNvbkByZWRoYXQuY29tAAoJECObm247sIsik80P/2GQQ25ipSDadWAMGE2f ylA03/rPJ0OoE4H09bvHELcrZEV0LvaaOpaT0xZfxLa/TuiYyY7h+Yi30BgVLZNQ pvD2RhKWhRheneFllaPCcYwfK80lntnmOHd6ZjKvpddXKEwoksXUq657yWtBqnvK fjIjLPx/gFpfvAFM3miHPHhPtURi3utTvKKF2U34qWPSYSqlczRqzHx+c0gyqMVQ iDYlKRbbpDIuTgh1MpL26Ia6xKsOUOKBe9pOh12pbB3Hix8ZWCDIVhPbUIj9uFoB uTrftguba9SMV1iMtr/aqiwImxEwp9gR3t6b0MRVWlHqx3QKn1/EgNWOI6ybRsfL FEspW4dPl9ruUTMbZ83fzvpJGihPx/nxoOnpSPd/cCCNLOyXbhSmZWA+3CgBdXME vu614SEyRqtdJSQY+RfVr0cM9yImWal0PLJeU2+/VII/Sp+kqYEm4mwVzxTwrrjk vSsLjg8Ch4zv/dNFnDikcHRpkxYmS5NLUeP2Htyfl1BVxHNLCATZWgSKzG3fFzV0 jWP6yG27/dVrVhKXb9X+yrPFE9/2Uq9pUkIdDR/Mfb54GJtSXmJIQVIzgmSQpSlQ qXCZufVLY38xtLvl+hGpWa/DBBhnItRzkTwL7gkBlzz1L4Ajy4T84QqKLxmyxsIj bsmaPit0CQI0iOzZ1xMrJlq3 =E6Zs -----END PGP SIGNATURE----- Merge tag 'vfio-v6.18-rc1' of https://github.com/awilliam/linux-vfio Pull VFIO updates from Alex Williamson: - Use fdinfo to expose the sysfs path of a device represented by a vfio device file (Alex Mastro) - Mark vfio-fsl-mc, vfio-amba, and the reset functions for vfio-platform for removal as these are either orphaned or believed to be unused (Alex Williamson) - Add reviewers for vfio-platform to save it from also being marked for removal (Mostafa Saleh, Pranjal Shrivastava) - VFIO selftests, including basic sanity testing and minimal userspace drivers for testing against real hardware. This is also expected to provide integration with KVM selftests for KVM-VFIO interfaces (David Matlack, Josh Hilke) - Fix drivers/cdx and vfio/cdx to build without CONFIG_GENERIC_MSI_IRQ (Nipun Gupta) - Fix reference leak in hisi_acc (Miaoqian Lin) - Use consistent return for unsupported device feature (Alex Mastro) - Unwind using the correct memory free callback in vfio/pds (Zilin Guan) - Use IRQ_DISABLE_LAZY flag to improve handling of pre-PCI2.3 INTx and resolve stalled interrupt on ppc64 (Timothy Pearson) - Enable GB300 in nvgrace-gpu vfio-pci variant driver (Tushar Dave) - Misc: - Drop unnecessary ternary conversion in vfio/pci (Xichao Zhao) - Grammatical fix in nvgrace-gpu (Morduan Zang) - Update Shameer's email address (Shameer Kolothum) - Fix document build warning (Alex Williamson) * tag 'vfio-v6.18-rc1' of https://github.com/awilliam/linux-vfio: (48 commits) vfio/nvgrace-gpu: Add GB300 SKU to the devid table vfio/pci: Fix INTx handling on legacy non-PCI 2.3 devices vfio/pds: replace bitmap_free with vfree vfio: return -ENOTTY for unsupported device feature hisi_acc_vfio_pci: Fix reference leak in hisi_acc_vfio_debug_init vfio/platform: Mark reset drivers for removal vfio/amba: Mark for removal MAINTAINERS: Add myself as VFIO-platform reviewer MAINTAINERS: Add myself as VFIO-platform reviewer docs: proc.rst: Fix VFIO Device title formatting vfio: selftests: Fix .gitignore for already tracked files vfio/cdx: update driver to build without CONFIG_GENERIC_MSI_IRQ cdx: don't select CONFIG_GENERIC_MSI_IRQ MAINTAINERS: Update Shameer Kolothum's email address vfio: selftests: Add a script to help with running VFIO selftests vfio: selftests: Make iommufd the default iommu_mode vfio: selftests: Add iommufd mode vfio: selftests: Add iommufd_compat_type1{,v2} modes vfio: selftests: Add vfio_type1v2_mode vfio: selftests: Replicate tests across all iommu_modes ...	2025-10-04 08:24:54 -07:00
Tushar Dave	407aa63018	vfio/nvgrace-gpu: Add GB300 SKU to the devid table GB300 is NVIDIA's Grace Blackwell Ultra Superchip. Add the GB300 SKU device-id to nvgrace_gpu_vfio_pci_table. Signed-off-by: Tushar Dave <tdave@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20250925170935.121587-1-tdave@nvidia.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-09-26 08:38:05 -06:00
Timothy Pearson	8b9f128947	vfio/pci: Fix INTx handling on legacy non-PCI 2.3 devices PCI devices prior to PCI 2.3 both use level interrupts and do not support interrupt masking, leading to a failure when passed through to a KVM guest on at least the ppc64 platform. This failure manifests as receiving and acknowledging a single interrupt in the guest, while the device continues to assert the level interrupt indicating a need for further servicing. When lazy IRQ masking is used on DisINTx- (non-PCI 2.3) hardware, the following sequence occurs: * Level IRQ assertion on device * IRQ marked disabled in kernel * Host interrupt handler exits without clearing the interrupt on the device * Eventfd is delivered to userspace * Guest processes IRQ and clears device interrupt * Device de-asserts INTx, then re-asserts INTx while the interrupt is masked * Newly asserted interrupt acknowledged by kernel VMM without being handled * Software mask removed by VFIO driver * Device INTx still asserted, host controller does not see new edge after EOI The behavior is now platform-dependent. Some platforms (amd64) will continue to spew IRQs for as long as the INTX line remains asserted, therefore the IRQ will be handled by the host as soon as the mask is dropped. Others (ppc64) will only send the one request, and if it is not handled no further interrupts will be sent. The former behavior theoretically leaves the system vulnerable to interrupt storm, and the latter will result in the device stalling after receiving exactly one interrupt in the guest. Work around this by disabling lazy IRQ masking for DisINTx- INTx devices. Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com> Link: https://lore.kernel.org/r/333803015.1744464.1758647073336.JavaMail.zimbra@raptorengineeringinc.com Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-09-26 08:37:32 -06:00
David Hildenbrand	fae6406bca	vfio/pci: drop nth_page() usage within SG entry It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Link: https://lkml.kernel.org/r/20250901150359.867252-33-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Alex Williamson <alex.williamson@redhat.com> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Yishai Hadas <yishaih@nvidia.com> Cc: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2025-09-21 14:22:08 -07:00
Zilin Guan	acb59a4bb8	vfio/pds: replace bitmap_free with vfree host_seq_bmp is allocated with vzalloc but is currently freed with bitmap_free, which uses kfree internally. This mismach prevents the resource from being released properly and may result in memory leaks or other issues. Fix this by freeing host_seq_bmp with vfree to match the vzalloc allocation. Fixes: `f232836a91` ("vfio/pds: Add support for dirty page tracking") Signed-off-by: Zilin Guan <zilin@seu.edu.cn> Reviewed-by: Brett Creeley <brett.creeley@amd.com> Link: https://lore.kernel.org/r/20250913153154.1028835-1-zilin@seu.edu.cn Signed-off-by: Alex Williamson <alex.williamson@redhat.com>	2025-09-19 14:48:58 -06:00

1 2 3 4 5 ...

1349 Commits