mirror of https://github.com/torvalds/linux.git
432 Commits
| Author | SHA1 | Message | Date |
|---|---|---|---|
|
|
208eed95fc |
soc: driver updates for 6.19
This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for
power management, as a preparation for future Tegra specific
changes.
- Reset controller updates with added drivers for LAN969x, eic770
and RZ/G3S SoCs.
- Protection of system controller registers on Renesas and Google SoCs,
to prevent trivially triggering a system crash from e.g. debugfs
access.
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEo6/YBQwIrVS28WGKmmx57+YAGNkFAmky/8gACgkQmmx57+YA
GNlqohAApPTLM6Q4gf1cIcsTVaP0uxx9CBgupCGuT5ORrOMKBghVWjTOTSxeEAab
UQF465QwYUUu602GH34UmRaY9CKW2bMIsfmkgmxNB4Y4Qd7yCgQNJ/h/TnN0rBH+
qTeEsRH/hax4miSNsh0oOZfVkZkg+23VF02d1VL0CcaX7y4oT45RPBQugrNx/gNS
fHfVwgIq8vJ8WyrmM1h2nv1i1vgSzEy50B3kY674BBw83FcJTafNLvD7N5DSgD1H
/I/2xeyEpb+oL1VfeHcXZaX/jf04O+cmvSzBi+MOH1tI3MpdxJib1vEYBdggoOWN
K/FFGgsOY+DNmJPpSnPTTu8UpzksS8SxGBP7M9Q8roKZwA2c9wLotxySvjki5yv8
2zvabRdzbrSaoYwsH9QnZdQ2hVkJ9W8MESu8PevD3yMNuFUzledPDWW0N1SbGm78
0ZdB6NPdaBZYHMNMRdFhN8P275/Mx5e0XWN9oYMQqjPooH7YkyT7hJWz6ao2PCJP
8mDmnW1RzL+LWf7mJ25ZEtS+YjmKA/PVmogRrGurKCadvdxXqCF09KNljICHhmmu
t0KB4dqw02OXLPvBk21qCi0zL56w1JDgqtS8suFvDYo9sCceeAbAcmpyoUOFj2N+
Upn976tb4iqFrr9mFswpmCJWPpqJkU+A+KnKsIRPU7N4kSrP35I=
=HvlN
-----END PGP SIGNATURE-----
Merge tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC driver updates from Arnd Bergmann:
"This is the first half of the driver changes:
- A treewide interface change to the "syscore" operations for power
management, as a preparation for future Tegra specific changes
- Reset controller updates with added drivers for LAN969x, eic770 and
RZ/G3S SoCs
- Protection of system controller registers on Renesas and Google
SoCs, to prevent trivially triggering a system crash from e.g.
debugfs access
- soc_device identification updates on Nvidia, Exynos and Mediatek
- debugfs support in the ST STM32 firewall driver
- Minor updates for SoC drivers on AMD/Xilinx, Renesas, Allwinner, TI
- Cleanups for memory controller support on Nvidia and Renesas"
* tag 'soc-drivers-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (114 commits)
memory: tegra186-emc: Fix missing put_bpmp
Documentation: reset: Remove reset_controller_add_lookup()
reset: fix BIT macro reference
reset: rzg2l-usbphy-ctrl: Fix a NULL vs IS_ERR() bug in probe
reset: th1520: Support reset controllers in more subsystems
reset: th1520: Prepare for supporting multiple controllers
dt-bindings: reset: thead,th1520-reset: Add controllers for more subsys
dt-bindings: reset: thead,th1520-reset: Remove non-VO-subsystem resets
reset: remove legacy reset lookup code
clk: davinci: psc: drop unused reset lookup
reset: rzg2l-usbphy-ctrl: Add support for RZ/G3S SoC
reset: rzg2l-usbphy-ctrl: Add support for USB PWRRDY
dt-bindings: reset: renesas,rzg2l-usbphy-ctrl: Document RZ/G3S support
reset: eswin: Add eic7700 reset driver
dt-bindings: reset: eswin: Documentation for eic7700 SoC
reset: sparx5: add LAN969x support
dt-bindings: reset: microchip: Add LAN969x support
soc: rockchip: grf: Add select correct PWM implementation on RK3368
soc/tegra: pmc: Add USB wake events for Tegra234
amba: tegra-ahb: Fix device leak on SMMU enable
...
|
|
|
|
1eb0ae6fbd |
iommupt/vtd: Support mgaw's less than a 4 level walk for first stage
If the IOVA is limited to less than 48 the page table will be constructed
with a 3 level configuration which is unsupported by hardware.
Like the second stage the caller needs to pass in both the top_level an
the vasz to specify a table that has more levels than required to hold the
IOVA range.
Fixes:
|
|
|
|
d856f9d278 |
iommupt/vtd: Allow VT-d to have a larger table top than the vasz requires
VT-d second stage HW specifies both the maximum IOVA and the supported
table walk starting points. Weirdly there is HW that only supports a 4
level walk but has a maximum IOVA that only needs 3.
The current code miscalculates this and creates a wrongly sized page table
which ultimately fails the compatibility check for number of levels.
This is fixed by allowing the page table to be created with both a vasz
and top_level input. The vasz will set the aperture for the domain while
the top_level will set the page table geometry.
Add top_level to vtdss and correct the logic in VT-d to generate the right
top_level and vasz from mgaw and sagaw.
Fixes:
|
|
|
|
6cbc09b771 |
iommu/vt-d: Restore previous domain::aperture_end calculation
Commit |
|
|
|
a97fbc3ee3 |
syscore: Pass context data to callbacks
Several drivers can benefit from registering per-instance data along with the syscore operations. To achieve this, move the modifiable fields out of the syscore_ops structure and into a separate struct syscore that can be registered with the framework. Add a void * driver data field for drivers to store contextual data that will be passed to the syscore ops. Acked-by: Rafael J. Wysocki (Intel) <rafael@kernel.org> Signed-off-by: Thierry Reding <treding@nvidia.com> |
|
|
|
101a285411 |
iommu/vt-d: Follow PT_FEAT_DMA_INCOHERENT into the PASID entry
Currently a incoherent walk domain cannot be attached to a coherent capable iommu. Kevin says HW probably doesn't exist with such a mixture, but making the driver support it makes logical sense anyhow. When building the PASID entry the PWSNP (Page Walk Snoop) bit tells the HW if it should issue snoops. If the page table is cache flushed because of PT_FEAT_DMA_INCOHERENT then it is fine to set this bit to 0 even if the HW supports 1. Weaken the compatible check to permit a coherent instance to accept an incoherent table and fix the PASID table construction to set PWSNP from PT_FEAT_DMA_INCOHERENT. SVA always sets PWSNP. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> |
|
|
|
d373449d8e |
iommu/vt-d: Use the generic iommu page table
Replace the VT-d iommu_domain implementation of the VT-d second stage and
first stage page tables with the iommupt VTDSS and x86_64
pagetables. x86_64 is shared with the AMD driver.
There are a couple notable things in VT-d:
- Like AMD the second stage format is not sign extended, unlike AMD it
cannot decode a full 64 bits. The first stage format is a normal sign
extended x86 page table
- The HW caps can indicate how many levels, how many address bits and what
leaf page sizes are supported in HW. As before the highest number of
levels that can translate the entire supported address width is used.
The supported page sizes are adjusted directly from the dedicated
first/second stage cap bits.
- VTD requires flushing 'write buffers'. This logic is left unchanged,
the write buffer flushes on any gather flush or through iotlb_sync_map.
- Like ARM, VTD has an optional non-coherent page table walker that
requires cache flushing. This is supported through PT_FEAT_DMA_INCOHERENT
the same as ARM, however x86 can't use the DMA API for flush, it must
call the arch function clflush_cache_range()
- The PT_FEAT_DYNAMIC_TOP can probably be supported on VT-d someday for the
second stage when it uses 128 bit atomic stores for the HW context
structures.
- PT_FEAT_VTDSS_FORCE_WRITEABLE is used to work around ERRATA_772415_SPR17
- A kernel command line parameter "sp_off" disables all page sizes except
4k
Remove all the unused iommu_domain page table code. The debugfs paths have
their own independent page table walker that is left alone for now.
This corrects a race with the non-coherent walker that the ARM
implementations have fixed:
CPU 0 CPU 1
pfn_to_dma_pte() pfn_to_dma_pte()
pte = &parent[offset];
if (!dma_pte_present(pte)) {
try_cmpxchg64(&pte->val)
pte = &parent[offset];
.. dma_pte_present(pte) ..
[...]
// iommu_map() completes
// Device does DMA
domain_flush_cache(pte)
The CPU 1 mapping operation shares a page table level with the CPU 0
mapping operation. CPU 0 installed a new page table level but has not
flushed it yet. CPU1 returns from iommu_map() and the device does DMA. The
non coherent walker fails to see the new table level installed by CPU 0
and fails the DMA with non-present.
The iommupt PT_FEAT_DMA_INCOHERENT implementation uses the ARM design of
storing a flag when CPU 0 completes the flush. If the flag is not set CPU
1 will also flush to ensure the HW can fully walk to the PTE being
installed.
Cc: Tina Zhang <tina.zhang@intel.com>
Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
|
|
|
|
fd714986e4 |
iommu: Pass in old domain to attach_dev callback functions
The IOMMU core attaches each device to a default domain on probe(). Then, every new "attach" operation has a fundamental meaning of two-fold: - detach from its currently attached (old) domain - attach to a given new domain Modern IOMMU drivers following this pattern usually want to clean up the things related to the old domain, so they call iommu_get_domain_for_dev() to fetch the old domain. Pass in the old domain pointer from the core to drivers, aligning with the set_dev_pasid op that does so already. Ensure all low-level attach fcuntions in the core can forward the correct old domain pointer. Thus, rework those functions as well. Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> |
|
|
|
5f4b8c03f4 | Merge branches 'apple/dart', 'ti/omap', 'riscv', 'intel/vt-d' and 'amd/amd-vi' into next | |
|
|
5ef7e24c74 |
iommu/vt-d: PRS isn't usable if PDS isn't supported
The specification, Section 7.10, "Software Steps to Drain Page Requests &
Responses," requires software to submit an Invalidation Wait Descriptor
(inv_wait_dsc) with the Page-request Drain (PD=1) flag set, along with
the Invalidation Wait Completion Status Write flag (SW=1). It then waits
for the Invalidation Wait Descriptor's completion.
However, the PD field in the Invalidation Wait Descriptor is optional, as
stated in Section 6.5.2.9, "Invalidation Wait Descriptor":
"Page-request Drain (PD): Remapping hardware implementations reporting
Page-request draining as not supported (PDS = 0 in ECAP_REG) treat this
field as reserved."
This implies that if the IOMMU doesn't support the PDS capability, software
can't drain page requests and group responses as expected.
Do not enable PCI/PRI if the IOMMU doesn't support PDS.
Reported-by: Joel Granados <joel.granados@kernel.org>
Closes: https://lore.kernel.org/r/20250909-jag-pds-v1-1-ad8cba0e494e@kernel.org
Fixes:
|
|
|
|
dce043c07c |
iommu/vt-d: Fix __domain_mapping()'s usage of switch_to_super_page()
switch_to_super_page() assumes the memory range it's working on is aligned
to the target large page level. Unfortunately, __domain_mapping() doesn't
take this into account when using it, and will pass unaligned ranges
ultimately freeing a PTE range larger than expected.
Take for example a mapping with the following iov_pfn range [0x3fe400,
0x4c0600), which should be backed by the following mappings:
iov_pfn [0x3fe400, 0x3fffff] covered by 2MiB pages
iov_pfn [0x400000, 0x4bffff] covered by 1GiB pages
iov_pfn [0x4c0000, 0x4c05ff] covered by 2MiB pages
Under this circumstance, __domain_mapping() will pass [0x400000, 0x4c05ff]
to switch_to_super_page() at a 1 GiB granularity, which will in turn
free PTEs all the way to iov_pfn 0x4fffff.
Mitigate this by rounding down the iov_pfn range passed to
switch_to_super_page() in __domain_mapping()
to the target large page level.
Additionally add range alignment checks to switch_to_super_page.
Fixes:
|
|
|
|
0bd0a41a51 |
pci-v6.17-changes
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAmiL3OkUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vz9bhAAqiD9REYlNUgGX/bEBgCVPFdtjjTz
FpSLzG23vWd2J0FEy04qtQWH9j71IXnM+yMybzsMe9SsPt2HhczzSCIMpPj0FZNN
ccOf3gA/KqPux7FORrS3mpM8OO4ICt3XZhCji3nNg5iW5XlH+NrQKPVxRlvBB0rP
+7RxSjDClUdZ97QSSmp1uZ7Qh1qyV0Ht0qjPMwecrnB2kApt4ZaMphAaKPEjX/4f
RgZPFqbIpRWt9e87Z8ADr5c2jokZAzIV0zauQ2fhbjBkTcXIXL3yOzUbR+ngBWDD
oq21rXJBUCQheA7J6j2SKabgF9AZaI5NI9ERld5vJ1inXSZCyuyKopN1AzuKZquG
N+jyYJqZC99ePvMLbTWs/spU58J03A6TOwaJNE3ISRgbnxFkhvLl7h68XuTDonZm
hYGloXXUj+i+rh7/eJIDDWa9MTpEvl2p1zc6EDIZ/umlnHwg9rGlGQVARMCs6Ist
EiJQEtjMMlXiBJMkFhpxesOdyonGkxAL9WtT6MoEOFF7dqgsTqSKiDUPa+6MHV+I
tsTB630J3ROsWGfQD1uJI2BrCm+op4j6faamH6UMqCrUU0TUZMHiRR3qVWbM6qgU
/WL1gZ96uy5I7UoE0+gH+wMhMClO2BnsxffocToDE5wOYpGDd5BwPEoY8ej8U2lu
CBMCkMor1jDtS8Y=
=ipv3
-----END PGP SIGNATURE-----
Merge tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI updates from Bjorn Helgaas:
"Enumeration:
- Allow built-in drivers, not just modular drivers, to use async
initial probing (Lukas Wunner)
- Support Immediate Readiness even on devices with no PM Capability
(Sean Christopherson)
- Consolidate definition of PCIE_RESET_CONFIG_WAIT_MS (100ms), the
required delay between a reset and sending config requests to a
device (Niklas Cassel)
- Add pci_is_display() to check for "Display" base class and use it
in ALSA hda, vfio, vga_switcheroo, vt-d (Mario Limonciello)
- Allow 'isolated PCI functions' (multi-function devices without a
function 0) for LoongArch, similar to s390 and jailhouse (Huacai
Chen)
Power control:
- Add ability to enable optional slot clock for cases where the PCIe
host controller and the slot are supplied by different clocks
(Marek Vasut)
PCIe native device hotplug:
- Fix runtime PM ref imbalance on Hot-Plug Capable ports caused by
misinterpreting a config read failure after a device has been
removed (Lukas Wunner)
- Avoid creating a useless PCIe port service device for pciehp if the
slot is handled by the ACPI hotplug driver (Lukas Wunner)
- Ignore ACPI hotplug slots when calculating depth of pciehp hotplug
ports (Lukas Wunner)
Virtualization:
- Save VF resizable BAR state and restore it after reset (Michał
Winiarski)
- Allow IOV resources (VF BARs) to be resized (Michał Winiarski)
- Add pci_iov_vf_bar_set_size() so drivers can control VF BAR size
(Michał Winiarski)
Endpoint framework:
- Add RC-to-EP doorbell support using platform MSI controller,
including a test case (Frank Li)
- Allow BAR assignment via configfs so platforms have flexibility in
determining BAR usage (Jerome Brunet)
Native PCIe controller drivers:
- Convert amazon,al-alpine-v[23]-pcie, apm,xgene-pcie,
axis,artpec6-pcie, marvell,armada-3700-pcie, st,spear1340-pcie to
DT schema format (Rob Herring)
- Use dev_fwnode() instead of of_fwnode_handle() to remove OF
dependency in altera (fixes an unused variable), designware-host,
mediatek, mediatek-gen3, mobiveil, plda, xilinx, xilinx-dma,
xilinx-nwl (Jiri Slaby, Arnd Bergmann)
- Convert aardvark, altera, brcmstb, designware-host, iproc,
mediatek, mediatek-gen3, mobiveil, plda, rcar-host, vmd, xilinx,
xilinx-dma, xilinx-nwl from using pci_msi_create_irq_domain() to
using msi_create_parent_irq_domain() instead; this makes the
interrupt controller per-PCI device, allows dynamic allocation of
vectors after initialization, and allows support of IMS (Nam Cao)
APM X-Gene PCIe controller driver:
- Rewrite MSI handling to MSI CPU affinity, drop useless CPU hotplug
bits, use device-managed memory allocations, and clean things up
(Marc Zyngier)
- Probe xgene-msi as a standard platform driver rather than a
subsys_initcall (Marc Zyngier)
Broadcom STB PCIe controller driver:
- Add optional DT 'num-lanes' property and if present, use it to
override the Maximum Link Width advertised in Link Capabilities
(Jim Quinlan)
Cadence PCIe controller driver:
- Use PCIe Message routing types from the PCI core rather than
defining private ones (Hans Zhang)
Freescale i.MX6 PCIe controller driver:
- Add IMX8MQ_EP third 64-bit BAR in epc_features (Richard Zhu)
- Add IMX8MM_EP and IMX8MP_EP fixed 256-byte BAR 4 in epc_features
(Richard Zhu)
- Configure LUT for MSI/IOMMU in Endpoint mode so Root Complex can
trigger doorbel on Endpoint (Frank Li)
- Remove apps_reset (LTSSM_EN) from
imx_pcie_{assert,deassert}_core_reset(), which fixes a hotplug
regression on i.MX8MM (Richard Zhu)
- Delay Endpoint link start until configfs 'start' written (Richard
Zhu)
Intel VMD host bridge driver:
- Add Intel Panther Lake (PTL)-H/P/U Vendor ID (George D Sworo)
Qualcomm PCIe controller driver:
- Add DT binding and driver support for SA8255p, which supports ECAM
for Configuration Space access (Mayank Rana)
- Update DT binding and driver to describe PHYs and per-Root Port
resets in a Root Port stanza and deprecate describing them in the
host bridge; this makes it possible to support multiple Root Ports
in the future (Krishna Chaitanya Chundru)
- Add Qualcomm QCS615 to SM8150 DT binding (Ziyue Zhang)
- Add Qualcomm QCS8300 to SA8775p DT binding (Ziyue Zhang)
- Drop TBU and ref clocks from Qualcomm SM8150 and SC8180x DT
bindings (Konrad Dybcio)
- Document 'link_down' reset in Qualcomm SA8775P DT binding (Ziyue
Zhang)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Rockchip PCIe controller driver:
- Drop unused PCIe Message routing and code definitions (Hans Zhang)
- Remove several unused header includes (Hans Zhang)
- Use standard PCIe config register definitions instead of
rockchip-specific redefinitions (Geraldo Nascimento)
- Set Target Link Speed to 5.0 GT/s before retraining so we have a
chance to train at a higher speed (Geraldo Nascimento)
Rockchip DesignWare PCIe controller driver:
- Prevent race between link training and register update via DBI by
inhibiting link training after hot reset and link down (Wilfred
Mallawa)
- Add required PCIE_RESET_CONFIG_WAIT_MS delay after Link up IRQ
(Niklas Cassel)
Sophgo PCIe controller driver:
- Add DT binding and driver for Sophgo SG2044 PCIe controller driver
in Root Complex mode (Inochi Amaoto)
Synopsys DesignWare PCIe controller driver:
- Add required PCIE_RESET_CONFIG_WAIT_MS after waiting for Link up on
Ports that support > 5.0 GT/s. Slower Ports still rely on the
not-quite-correct PCIE_LINK_WAIT_SLEEP_MS 90ms default delay while
waiting for the Link (Niklas Cassel)"
* tag 'pci-v6.17-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (116 commits)
dt-bindings: PCI: qcom,pcie-sa8775p: Document 'link_down' reset
dt-bindings: PCI: Remove 83xx-512x-pci.txt
dt-bindings: PCI: Convert amazon,al-alpine-v[23]-pcie to DT schema
dt-bindings: PCI: Convert marvell,armada-3700-pcie to DT schema
dt-bindings: PCI: Convert apm,xgene-pcie to DT schema
dt-bindings: PCI: Convert axis,artpec6-pcie to DT schema
dt-bindings: PCI: Convert st,spear1340-pcie to DT schema
PCI: Move is_pciehp check out of pciehp_is_native()
PCI: pciehp: Use is_pciehp instead of is_hotplug_bridge
PCI/portdrv: Use is_pciehp instead of is_hotplug_bridge
PCI/ACPI: Fix runtime PM ref imbalance on Hot-Plug Capable ports
selftests: pci_endpoint: Add doorbell test case
misc: pci_endpoint_test: Add doorbell test case
PCI: endpoint: pci-epf-test: Add doorbell test support
PCI: endpoint: Add pci_epf_align_inbound_addr() helper for inbound address alignment
PCI: endpoint: pci-ep-msi: Add checks for MSI parent and mutability
PCI: endpoint: Add RC-to-EP doorbell support using platform MSI controller
PCI: dwc: Add Sophgo SG2044 PCIe controller driver in Root Complex mode
PCI: vmd: Switch to msi_create_parent_irq_domain()
PCI: vmd: Convert to lock guards
...
|
|
|
|
c93529ad4f |
iommufd 6.17 merge window pull
- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set of
invalidation commands that are suitable for improving guest invalidation
performance and easy for the HW to virtualize.
This PR brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the doorbell
pages, and get the guest physical range for the command queue ring
itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built on
the new iommufd command queue features. It works with the existing SMMU
driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above cleanly,
changes to the general ioctl and object helpers, driver support for
VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation becomes
linked to HW or CC TSM configuration and objects. If a VFIO device is
removed from iommufd those HW objects should also be cleaned up to
prevent a sort of UAF. This became important now that we have HW backing
the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCaIpl9AAKCRCFwuHvBreF
YS5tAP9MDIRML5a/2IOhzcsc4LiDkWTMKm2m1wcRYd+iU2aFVQEAjdghINLHrUlx
HVuIDvNvWIUED/oTAp5kCxQ7PBFN4gU=
=NmCO
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"This broadly brings the assigned HW command queue support to iommufd.
This feature is used to improve SVA performance in VMs by avoiding
paravirtualization traps during SVA invalidations.
Along the way I think some of the core logic is in a much better state
to support future driver backed features.
Summary:
- IOMMU HW now has features to directly assign HW command queues to a
guest VM. In this mode the command queue operates on a limited set
of invalidation commands that are suitable for improving guest
invalidation performance and easy for the HW to virtualize.
This brings the generic infrastructure to allow IOMMU drivers to
expose such command queues through the iommufd uAPI, mmap the
doorbell pages, and get the guest physical range for the command
queue ring itself.
- An implementation for the NVIDIA SMMUv3 extension "cmdqv" is built
on the new iommufd command queue features. It works with the
existing SMMU driver support for cmdqv in guest VMs.
- Many precursor cleanups and improvements to support the above
cleanly, changes to the general ioctl and object helpers, driver
support for VDEVICE, and mmap pgoff cookie infrastructure.
- Sequence VDEVICE destruction to always happen before VFIO device
destruction. When using the above type features, and also in future
confidential compute, the internal virtual device representation
becomes linked to HW or CC TSM configuration and objects. If a VFIO
device is removed from iommufd those HW objects should also be
cleaned up to prevent a sort of UAF. This became important now that
we have HW backing the VDEVICE.
- Fix one syzkaller found error related to math overflows during iova
allocation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (57 commits)
iommu/arm-smmu-v3: Replace vsmmu_size/type with get_viommu_size
iommu/arm-smmu-v3: Do not bother impl_ops if IOMMU_VIOMMU_TYPE_ARM_SMMUV3
iommufd: Rename some shortterm-related identifiers
iommufd/selftest: Add coverage for vdevice tombstone
iommufd/selftest: Explicitly skip tests for inapplicable variant
iommufd/vdevice: Remove struct device reference from struct vdevice
iommufd: Destroy vdevice on idevice destroy
iommufd: Add a pre_destroy() op for objects
iommufd: Add iommufd_object_tombstone_user() helper
iommufd/viommu: Roll back to use iommufd_object_alloc() for vdevice
iommufd/selftest: Test reserved regions near ULONG_MAX
iommufd: Prevent ALIGN() overflow
iommu/tegra241-cmdqv: import IOMMUFD module namespace
iommufd: Do not allow _iommufd_object_alloc_ucmd if abort op is set
iommu/tegra241-cmdqv: Add IOMMU_VEVENTQ_TYPE_TEGRA241_CMDQV support
iommu/tegra241-cmdqv: Add user-space use support
iommu/tegra241-cmdqv: Do not statically map LVCMDQs
iommu/tegra241-cmdqv: Simplify deinit flow in tegra241_cmdqv_remove_vintf()
iommu/tegra241-cmdqv: Use request_threaded_irq
iommu/arm-smmu-v3-iommufd: Add hw_info to impl_ops
...
|
|
|
|
53564f4005 |
IOMMU Updates for Linux v6.17:
- Core
* Remove the 'pgsize_bitmap' member from 'struct iommu_ops'
* Convert the x86 drivers over to msi_create_parent_irq_domain()
- AMD-Vi
* Add support for examining driver/device internals via debugfs
* Add support for "HATDis" to disable host translation when it is not
supported
* Add support for limiting the maximum host translation level based on
EFR[HATS]
- Apple DART
* Don't enable as built-in by default when ARCH_APPLE is selected
- Arm SMMU
* Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC
* Support for Qualcomm SM6115 MDSS parts
* Disable PRR on Qualcomm SM8250 as using these bits causes the
hypervisor to explode
- Intel VT-d
* Reorganize Intel VT-d to be ready for iommupt
* Optimize iotlb_sync_map for non-caching/non-RWBF modes
* Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
- Mediatek
* Fix build warnings when W=1
- Samsung Exynos
* Add support for reserved memory regions specified by the bootloader
- TI OMAP
* Use syscon_regmap_lookup_by_phandle_args() instead of parsing the
node manually
- Misc
* Cleanups and minor fixes across the board
-----BEGIN PGP SIGNATURE-----
iQFEBAABCgAuFiEEPxTL6PPUbjXGY88ct6xw3ITBYzQFAmiDhZ0QHHdpbGxAa2Vy
bmVsLm9yZwAKCRC3rHDchMFjNBZDB/0Q6p+Hxev6xpokVOTpKlyITA5JmhsXkvCn
4sOeuiXA9rXSKWOy5d8hCMLuMvZcEWgKe06B+Y4wBp5tsCk6eSepxfEtUV6IOS/a
Z/LNxELyCPpVp9piEOMaXWVFwJy7rfd15nQh5obupbqTvbjEVspP/vMAKoE0Db9R
6JEcDhA5Za4q1Vnk91lJMn5CC1Ih1sTUZ8ie8VOZCerUsQs5e8+GkTk5zwJQAMqn
W9I9QC4Fcs2zBJzl+Q7G6WNPraQxrp+9BGtoq2Mk5RDIO/iMUcl2cxIkqYCq6nSt
XgmuwKnUl7RH9HELIfbIi1LUwEbwUcYlGpVioQbhyPdNLzd51KSl
=5Dcj
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu updates from Will Deacon:
"Core:
- Remove the 'pgsize_bitmap' member from 'struct iommu_ops'
- Convert the x86 drivers over to msi_create_parent_irq_domain()
AMD-Vi:
- Add support for examining driver/device internals via debugfs
- Add support for "HATDis" to disable host translation when it is not
supported
- Add support for limiting the maximum host translation level based
on EFR[HATS]
Apple DART:
- Don't enable as built-in by default when ARCH_APPLE is selected
Arm SMMU:
- Devicetree bindings update for the Qualcomm SMMU in the "Milos" SoC
- Support for Qualcomm SM6115 MDSS parts
- Disable PRR on Qualcomm SM8250 as using these bits causes the
hypervisor to explode
Intel VT-d:
- Reorganize Intel VT-d to be ready for iommupt
- Optimize iotlb_sync_map for non-caching/non-RWBF modes
- Fix missed PASID in dev TLB invalidation in cache_tag_flush_all()
Mediatek:
- Fix build warnings when W=1
Samsung Exynos:
- Add support for reserved memory regions specified by the bootloader
TI OMAP:
- Use syscon_regmap_lookup_by_phandle_args() instead of parsing the
node manually
Misc:
- Cleanups and minor fixes across the board"
* tag 'iommu-updates-v6.17' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (48 commits)
iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
dt-bindings: arm-smmu: Remove sdm845-cheza specific entry
iommu/amd: Fix geometry.aperture_end for V2 tables
iommu/amd: Wrap debugfs ABI testing symbols snippets in literal code blocks
iommu/amd: Add documentation for AMD IOMMU debugfs support
iommu/amd: Add debugfs support to dump IRT Table
iommu/amd: Add debugfs support to dump device table
iommu/amd: Add support for device id user input
iommu/amd: Add debugfs support to dump IOMMU command buffer
iommu/amd: Add debugfs support to dump IOMMU Capability registers
iommu/amd: Add debugfs support to dump IOMMU MMIO registers
iommu/amd: Refactor AMD IOMMU debugfs initial setup
dt-bindings: arm-smmu: document the support on Milos
iommu/exynos: add support for reserved regions
iommu/arm-smmu: disable PRR on SM8250
iommu/arm-smmu-v3: Revert vmaster in the error path
iommu/io-pgtable-arm: Remove unused macro iopte_prot
iommu/arm-smmu-qcom: Add SM6115 MDSS compatible
iommu/qcom: Fix pgsize_bitmap
...
|
|
|
|
9f341a2aeb |
Merge branch 'intel/vt-d' into next
* intel/vt-d: iommu/vt-d: Fix UAF on sva unbind with pending IOPFs iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain iommu/vt-d: Deduplicate cache_tag_flush_all by reusing flush_range iommu/vt-d: Fix missing PASID in dev TLB flush with cache_tag_flush_all iommu/vt-d: Split paging_domain_compatible() iommu/vt-d: Split intel_iommu_enforce_cache_coherency() iommu/vt-d: Create unique domain ops for each stage iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags() iommu/vt-d: Do not wipe out the page table NID when devices detach iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free() iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid() iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes iommu/vt-d: Remove the CONFIG_X86 wrapping from iommu init hook |
|
|
|
f0b9d31c6e |
iommu/vt-d: Fix UAF on sva unbind with pending IOPFs
Commit |
|
|
|
cee686775f |
iommu/vt-d: Make iotlb_sync_map a static property of dmar_domain
Commit |
|
|
|
75952c4975 |
iommu/vt-d: Use pci_is_display()
The inline pci_is_display() helper does the same thing. Use it. Suggested-by: Bjorn Helgaas <bhelgaas@google.com> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Daniel Dadap <ddadap@nvidia.com> Reviewed-by: Simona Vetter <simona.vetter@ffwll.ch> Link: https://patch.msgid.link/20250717173812.3633478-5-superm1@kernel.org |
|
|
|
85cfaacc99 |
iommu/vt-d: Split paging_domain_compatible()
Make First/Second stage specific functions that follow the same pattern in intel_iommu_domain_alloc_first/second_stage() for computing EOPNOTSUPP. This makes the code easier to understand as if we couldn't create a domain with the parameters for this IOMMU instance then we certainly are not compatible with it. Check superpage support directly against the per-stage cap bits and the pgsize_bitmap. Add a note that the force_snooping is read without locking. The locking needs to cover the compatible check and the add of the device to the list. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/7-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-10-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
0fa6f08934 |
iommu/vt-d: Split intel_iommu_enforce_cache_coherency()
First Stage and Second Stage have very different ways to deny no-snoop. The first stage uses the PGSNP bit which is global per-PASID so enabling requires loading new PASID entries for all the attached devices. Second stage uses a bit per PTE, so enabling just requires telling future maps to set the bit. Since we now have two domain ops we can have two functions that can directly code their required actions instead of a bunch of logic dancing around use_first_level. Combine domain_set_force_snooping() into the new functions since they are the only caller. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-9-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
b33125296b |
iommu/vt-d: Create unique domain ops for each stage
Use the domain ops pointer to tell what kind of domain it is instead of the internal use_first_level indication. This also protects against wrongly using a SVA/nested/IDENTITY/BLOCKED domain type in places they should not be. The only remaining uses of use_first_level outside the paging domain are in paging_domain_compatible() and intel_iommu_enforce_cache_coherency(). Thus, remove the useless sets of use_first_level in intel_svm_domain_alloc() and intel_iommu_domain_alloc_nested(). None of the unique ops for these domain types ever reference it on their call chains. Add a WARN_ON() check in domain_context_mapping_one() as it only works with second stage. This is preparation for iommupt which will have different ops for each of the stages. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/5-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-8-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
b9434ba97c |
iommu/vt-d: Split intel_iommu_domain_alloc_paging_flags()
Create stage specific functions that check the stage specific conditions if each stage can be supported. Have intel_iommu_domain_alloc_paging_flags() call both stages in sequence until one does not return EOPNOTSUPP and prefer to use the first stage if available and suitable for the requested flags. Move second stage only operations like nested_parent and dirty_tracking into the second stage function for clarity. Move initialization of the iommu_domain members into paging_domain_alloc(). Drop initialization of domain->owner as the callers all do it. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/4-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-7-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
5c3687d578 |
iommu/vt-d: Do not wipe out the page table NID when devices detach
The NID is used to control which NUMA node memory for the page table is
allocated it from. It should be a permanent property of the page table
when it was allocated and not change during attach/detach of devices.
Reviewed-by: Wei Wang <wei.w.wang@intel.com>
Reviewed-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Link: https://lore.kernel.org/r/3-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com
Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com>
Fixes:
|
|
|
|
00939bebe5 |
iommu/vt-d: Fold domain_exit() into intel_iommu_domain_free()
It has only one caller, no need for two functions. Correct the WARN_ON() error handling to leak the entire page table if the HW is still referencing it so we don't UAF during WARN_ON recovery. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/2-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-5-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
cd0d0e4e48 |
iommu/vt-d: Lift the __pa to domain_setup_first_level/intel_svm_set_dev_pasid()
Pass the phys_addr_t down through the call chain from the top instead of passing a pgd_t * KVA. This moves the __pa() into domain_setup_first_level() which is the first function to obtain the pgd from the IOMMU page table in this call chain. The SVA flow is also adjusted to get the pa of the mm->pgd. iommput will move the __pa() into iommupt code, it never shares the KVA of the page table with the driver. Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/1-v3-dbbe6f7e7ae3+124ffe-vtd_prep_jgg@nvidia.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/20250714045028.958850-4-baolu.lu@linux.intel.com Signed-off-by: Will Deacon <will@kernel.org> |
|
|
|
12724ce3fe |
iommu/vt-d: Optimize iotlb_sync_map for non-caching/non-RWBF modes
The iotlb_sync_map iommu ops allows drivers to perform necessary cache
flushes when new mappings are established. For the Intel iommu driver,
this callback specifically serves two purposes:
- To flush caches when a second-stage page table is attached to a device
whose iommu is operating in caching mode (CAP_REG.CM==1).
- To explicitly flush internal write buffers to ensure updates to memory-
resident remapping structures are visible to hardware (CAP_REG.RWBF==1).
However, in scenarios where neither caching mode nor the RWBF flag is
active, the cache_tag_flush_range_np() helper, which is called in the
iotlb_sync_map path, effectively becomes a no-op.
Despite being a no-op, cache_tag_flush_range_np() involves iterating
through all cache tags of the iommu's attached to the domain, protected
by a spinlock. This unnecessary execution path introduces overhead,
leading to a measurable I/O performance regression. On systems with NVMes
under the same bridge, performance was observed to drop from approximately
~6150 MiB/s down to ~4985 MiB/s.
Introduce a flag in the dmar_domain structure. This flag will only be set
when iotlb_sync_map is required (i.e., when CM or RWBF is set). The
cache_tag_flush_range_np() is called only for domains where this flag is
set. This flag, once set, is immutable, given that there won't be mixed
configurations in real-world scenarios where some IOMMUs in a system
operate in caching mode while others do not. Theoretically, the
immutability of this flag does not impact functionality.
Reported-by: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
Closes: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2115738
Link: https://lore.kernel.org/r/20250701171154.52435-1-ioanna-maria.alifieraki@canonical.com
Fixes:
|
|
|
|
62622a8753 |
iommu: Allow an input type in hw_info op
The hw_info uAPI will support a bidirectional data_type field that can be used as an input field for user space to request for a specific info data. To prepare for the uAPI update, change the iommu layer first: - Add a new IOMMU_HW_INFO_TYPE_DEFAULT as an input, for which driver can output its only (or firstly) supported type - Update the kdoc accordingly - Roll out the type validation in the existing drivers Link: https://patch.msgid.link/r/00f4a2d3d930721f61367014717b3ba2d1e82a81.1752126748.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
|
|
|
4b57c057f9 |
iommu: Use enum iommu_hw_info_type for type in hw_info op
Replace u32 to make it clear. No functional changes. Also simplify the kdoc since the type itself is clear enough. Link: https://patch.msgid.link/r/651c50dee8ab900f691202ef0204cd5a43fdd6a2.1752126748.git.nicolinc@nvidia.com Reviewed-by: Pranjal Shrivastava <praan@google.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
|
|
|
25b1b75bba |
iommu/vt-d: Assign devtlb cache tag on ATS enablement
Commit <4f1492efb495> ("iommu/vt-d: Revert ATS timing change to fix boot
failure") placed the enabling of ATS in the probe_finalize callback. This
occurs after the default domain attachment, which is when the ATS cache
tag is assigned. Consequently, the device TLB cache tag is missed when the
domain is attached, leading to the device TLB not being invalidated in the
iommu_unmap paths.
Fix this by assigning the CACHE_TAG_DEVTLB cache tag when ATS is enabled.
Fixes:
|
|
|
|
8901812485 |
iommu: Remove ops.pgsize_bitmap from drivers that don't use it
These drivers all set the domain->pgsize_bitmap in their domain_alloc_paging() functions, so the ops value is never used. Delete it. Reviewed-by: Sven Peter <sven@svenpeter.dev> # for Apple DART Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Tomasz Jeznach <tjeznach@rivosinc.com> # for RISC-V Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/3-v2-68a2e1ba507c+1fb-iommu_rm_ops_pgsize_jgg@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> |
|
|
|
879b141b7c | Merge branches 'fixes', 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'fsl/pamu', 'mediatek', 'renesas/ipmmu', 's390', 'intel/vt-d', 'amd/amd-vi' and 'core' into next | |
|
|
320302baed |
iommu/vt-d: Restore context entry setup order for aliased devices
Commit |
|
|
|
f3fe7e1830 |
iommu/vt-d: Change dmar_ats_supported() to return boolean
According to "Function return values and names" in coding-style.rst, the dmar_ats_supported() function should return a boolean instead of an integer. Also, rename "ret" to "supported" to be more straightforward. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Link: https://lore.kernel.org/r/20250509140021.4029303-3-wei.w.wang@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
3df2ebcee9 |
iommu/vt-d: Eliminate pci_physfn() in dmar_find_matched_satc_unit()
The function dmar_find_matched_satc_unit() contains a duplicate call to pci_physfn(). This call is unnecessary as pci_physfn() has already been invoked by the caller. Removing the redundant call simplifies the code and improves efficiency a bit. Signed-off-by: Wei Wang <wei.w.wang@intel.com> Link: https://lore.kernel.org/r/20250509140021.4029303-2-wei.w.wang@intel.com Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
868720fe15 |
iommu/vt-d: Replace spin_lock with mutex to protect domain ida
The domain ID allocator is currently protected by a spin_lock. However, ida_alloc_range can potentially block if it needs to allocate memory to grow its internal structures. Replace the spin_lock with a mutex which allows sleep on block. Thus, the memory allocation flags can be updated from GFP_ATOMIC to GFP_KERNEL to allow blocking memory allocations if necessary. Introduce a new mutex, did_lock, specifically for protecting the domain ida. The existing spinlock will remain for protecting other intel_iommu fields. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250430021135.2370244-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
f93b4ac592 |
iommu/vt-d: Use ida to manage domain id
Switch the intel iommu driver to use the ida mechanism for managing domain IDs, replacing the previous fixed-size bitmap. The previous approach allocated a bitmap large enough to cover the maximum number of domain IDs supported by the hardware, regardless of the actual number of domains in use. This led to unnecessary memory consumption, especially on systems supporting a large number of iommu units but only utilizing a small number of domain IDs. The ida allocator dynamically manages the allocation and freeing of integer IDs, only consuming memory for the IDs that are currently in use. This significantly optimizes memory usage compared to the fixed-size bitmap. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20250430021135.2370244-2-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
6f5dc76580 |
iommu/vt-d: Restore WO permissions on second-level paging entries
VT-D HW can do WO permissions on the second-stage but not the first-stage
page table formats. The commit
|
|
|
|
f984fb09e6 |
iommu: Remove iommu_dev_enable/disable_feature()
No external drivers use these interfaces anymore. Furthermore, no existing iommu drivers implement anything in the callbacks. Remove them to avoid dead code. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Link: https://lore.kernel.org/r/20250418080130.1844424-9-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
17fce9d233 |
iommu/vt-d: Put iopf enablement in domain attach path
Update iopf enablement in the driver to use the new method, similar to the arm-smmu-v3 driver. Enable iopf support when any domain with an iopf_handler is attached, and disable it when the domain is removed. Place all the logic for controlling the PRI and iopf queue in the domain set/remove/replace paths. Keep track of the number of domains set to the device and PASIDs that require iopf. When the first domain requiring iopf is attached, add the device to the iopf queue and enable PRI. When the last domain is removed, remove it from the iopf queue and disable PRI. Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-4-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
7c8896dd4a |
iommu: Remove IOMMU_DEV_FEAT_SVA
None of the drivers implement anything here anymore, remove the dead code. Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Zhangfei Gao <zhangfei.gao@linaro.org> Link: https://lore.kernel.org/r/20250418080130.1844424-3-baolu.lu@linux.intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
2c8a7c66c9 |
iommu/vt-d: Apply quirk_iommu_igfx for 8086:0044 (QM57/QS57)
On the Lenovo ThinkPad X201, when Intel VT-d is enabled in the BIOS, the
kernel boots with errors related to DMAR, the graphical interface appeared
quite choppy, and the system resets erratically within a minute after it
booted:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Write NO_PASID] Request device [00:02.0] fault addr 0xb97ff000
[fault reason 0x05] PTE Write access is not set
Upon comparing boot logs with VT-d on/off, I found that the Intel Calpella
quirk (`quirk_calpella_no_shadow_gtt()') correctly applied the igfx IOMMU
disable/quirk correctly:
pci 0000:00:00.0: DMAR: BIOS has allocated no shadow GTT; disabling IOMMU
for graphics
Whereas with VT-d on, it went into the "else" branch, which then
triggered the DMAR handling fault above:
... else if (!disable_igfx_iommu) {
/* we have to ensure the gfx device is idle before we flush */
pci_info(dev, "Disabling batched IOTLB flush on Ironlake\n");
iommu_set_dma_strict();
}
Now, this is not exactly scientific, but moving 0x0044 to quirk_iommu_igfx
seems to have fixed the aforementioned issue. Running a few `git blame'
runs on the function, I have found that the quirk was originally
introduced as a fix specific to ThinkPad X201:
commit
|
|
|
|
4f1492efb4 |
iommu/vt-d: Revert ATS timing change to fix boot failure
Commit <5518f239aff1> ("iommu/vt-d: Move scalable mode ATS enablement to
probe path") changed the PCI ATS enablement logic to run earlier,
specifically before the default domain attachment.
On some client platforms, this change resulted in boot failures, causing
the kernel to panic with the following message and call trace:
Kernel panic - not syncing: DMAR hardware is malfunctioning
CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted 6.14.0-rc3+ #175
Call Trace:
<TASK>
dump_stack_lvl+0x6f/0xb0
dump_stack+0x10/0x16
panic+0x10a/0x2b7
iommu_enable_translation.cold+0xc/0xc
intel_iommu_init+0xe39/0xec0
? trace_hardirqs_on+0x1e/0xd0
? __pfx_pci_iommu_init+0x10/0x10
pci_iommu_init+0xd/0x40
do_one_initcall+0x5b/0x390
kernel_init_freeable+0x26d/0x2b0
? __pfx_kernel_init+0x10/0x10
kernel_init+0x15/0x120
ret_from_fork+0x35/0x60
? __pfx_kernel_init+0x10/0x10
ret_from_fork_asm+0x1a/0x30
RIP: 1f0f:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 0000:0000000000000000 EFLAGS: 841f0f2e66 ORIG_RAX:
1f0f2e6600000000
RAX: 0000000000000000 RBX: 1f0f2e6600000000 RCX:
2e66000000000084
RDX: 0000000000841f0f RSI: 000000841f0f2e66 RDI:
00841f0f2e660000
RBP: 00841f0f2e660000 R08: 00841f0f2e660000 R09:
000000841f0f2e66
R10: 0000000000841f0f R11: 2e66000000000084 R12:
000000841f0f2e66
R13: 0000000000841f0f R14: 2e66000000000084 R15:
1f0f2e6600000000
</TASK>
---[ end Kernel panic - not syncing: DMAR hardware is malfunctioning ]---
Fix this by reverting the timing change for ATS enablement introduced by
the offending commit and restoring the previous behavior.
Fixes:
|
|
|
|
5087f663c2 |
iommu/pages: Remove iommu_alloc_page_node()
Use iommu_alloc_pages_node_sz() instead. AMD and Intel are both using 4K pages for these structures since those drivers only work on 4K PAGE_SIZE. riscv is also spec'd to use SZ_4K. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/21-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
868240c34e |
iommu: Change iommu_iotlb_gather to use iommu_page_list
This converts the remaining places using list of pages to the new API. The Intel free path was shared with its gather path, so it is converted at the same time. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/11-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
3e8e986ce8 |
iommu/pages: Remove iommu_free_page()
Use iommu_free_pages() instead. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Mostafa Saleh <smostafa@google.com> Tested-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/6-v4-c8663abbb606+3f7-iommu_pages_jgg@nvidia.com Signed-off-by: Joerg Roedel <jroedel@suse.de> |
|
|
|
7d8c490ba3 |
iommu/vt-d: Remove an unnecessary call set_dma_ops()
Do not touch per-device DMA ops when the driver has been converted to use
the dma-iommu API.
Fixes:
|
|
|
|
48552153cf |
iommufd 6.15 merge window pull
Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are clearly
linked to a device. This is linked to the VIOMMU object and is intended to
be used by a VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered through
a simple FD returning event records on read().
- PASID support in VFIO. "Process Address Space ID" is a PCI feature that
allows the device to tag all PCI DMA operations with an ID. The IOMMU
will then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the hypervisor to
manage each PASID entry. The support is generic so any VFIO user could
attach any translation to a PASID, and the support should work on ARM
SMMUv3 as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate what
owns the various opaque owner fields
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRRRCHOFoQz/8F5bUaFwuHvBreFYQUCZ+q6NgAKCRCFwuHvBreF
YZ3zAQDbl4/Z0O+CLN2AXq4Zeiyq1HTSoF94hzqmm7lQ17zTIwD8CCdyLXHvupaq
tkBIv5IovpaxlrSk6M0kh2K8vPCk9Qk=
=CIM3
-----END PGP SIGNATURE-----
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd updates from Jason Gunthorpe:
"Two significant new items:
- Allow reporting IOMMU HW events to userspace when the events are
clearly linked to a device.
This is linked to the VIOMMU object and is intended to be used by a
VMM to forward HW events to the virtual machine as part of
emulating a vIOMMU. ARM SMMUv3 is the first driver to use this
mechanism. Like the existing fault events the data is delivered
through a simple FD returning event records on read().
- PASID support in VFIO.
The "Process Address Space ID" is a PCI feature that allows the
device to tag all PCI DMA operations with an ID. The IOMMU will
then use the ID to select a unique translation for those DMAs. This
is part of Intel's vIOMMU support as VT-D HW requires the
hypervisor to manage each PASID entry.
The support is generic so any VFIO user could attach any
translation to a PASID, and the support should work on ARM SMMUv3
as well. AMD requires additional driver work.
Some minor updates, along with fixes:
- Prevent using nested parents with fault's, no driver support today
- Put a single "cookie_type" value in the iommu_domain to indicate
what owns the various opaque owner fields"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd: (49 commits)
iommufd: Test attach before detaching pasid
iommufd: Fix iommu_vevent_header tables markup
iommu: Convert unreachable() to BUG()
iommufd: Balance veventq->num_events inc/dec
iommufd: Initialize the flags of vevent in iommufd_viommu_report_event()
iommufd/selftest: Add coverage for reporting max_pasid_log2 via IOMMU_HW_INFO
iommufd: Extend IOMMU_GET_HW_INFO to report PASID capability
vfio: VFIO_DEVICE_[AT|DE]TACH_IOMMUFD_PT support pasid
vfio-iommufd: Support pasid [at|de]tach for physical VFIO devices
ida: Add ida_find_first_range()
iommufd/selftest: Add coverage for iommufd pasid attach/detach
iommufd/selftest: Add test ops to test pasid attach/detach
iommufd/selftest: Add a helper to get test device
iommufd/selftest: Add set_dev_pasid in mock iommu
iommufd: Allow allocating PASID-compatible domain
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
iommufd: Enforce PASID-compatible domain for RID
iommufd: Support pasid attach/replace
iommufd: Enforce PASID-compatible domain in PASID path
iommufd/device: Add pasid_attach array to track per-PASID attach
...
|
|
|
|
ce15c13e7a |
iommu/vt-d: Add IOMMU_HWPT_ALLOC_PASID support
Intel iommu driver just treats it as a nop since Intel VT-d does not have special requirement on domains attached to either the PASID or RID of a PASID-capable device. Link: https://patch.msgid.link/r/20250321171940.7213-14-yi.l.liu@intel.com Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Yi Liu <yi.l.liu@intel.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> |
|
|
|
22df63a23a | Merge branches 'apple/dart', 'arm/smmu/updates', 'arm/smmu/bindings', 'rockchip', 's390', 'core', 'intel/vt-d' and 'amd/amd-vi' into next | |
|
|
93ae6e68b6 |
iommu/vt-d: Fix possible circular locking dependency
We have recently seen report of lockdep circular lock dependency warnings
on platforms like Skylake and Kabylake:
======================================================
WARNING: possible circular locking dependency detected
6.14.0-rc6-CI_DRM_16276-gca2c04fe76e8+ #1 Not tainted
------------------------------------------------------
swapper/0/1 is trying to acquire lock:
ffffffff8360ee48 (iommu_probe_device_lock){+.+.}-{3:3},
at: iommu_probe_device+0x1d/0x70
but task is already holding lock:
ffff888102c7efa8 (&device->physical_node_lock){+.+.}-{3:3},
at: intel_iommu_init+0xe75/0x11f0
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #6 (&device->physical_node_lock){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
intel_iommu_init+0xe75/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #5 (dmar_global_lock){++++}-{3:3}:
down_read+0x43/0x1d0
enable_drhd_fault_handling+0x21/0x110
cpuhp_invoke_callback+0x4c6/0x870
cpuhp_issue_call+0xbf/0x1f0
__cpuhp_setup_state_cpuslocked+0x111/0x320
__cpuhp_setup_state+0xb0/0x220
irq_remap_enable_fault_handling+0x3f/0xa0
apic_intr_mode_init+0x5c/0x110
x86_late_time_init+0x24/0x40
start_kernel+0x895/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #4 (cpuhp_state_mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__cpuhp_setup_state_cpuslocked+0x67/0x320
__cpuhp_setup_state+0xb0/0x220
page_alloc_init_cpuhp+0x2d/0x60
mm_core_init+0x18/0x2c0
start_kernel+0x576/0xbd0
x86_64_start_reservations+0x18/0x30
x86_64_start_kernel+0xbf/0x110
common_startup_64+0x13e/0x141
-> #3 (cpu_hotplug_lock){++++}-{0:0}:
__cpuhp_state_add_instance+0x4f/0x220
iova_domain_init_rcaches+0x214/0x280
iommu_setup_dma_ops+0x1a4/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #2 (&domain->iova_cookie->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_setup_dma_ops+0x16b/0x710
iommu_device_register+0x17d/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #1 (&group->mutex){+.+.}-{3:3}:
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
__iommu_probe_device+0x24c/0x4e0
probe_iommu_group+0x2b/0x50
bus_for_each_dev+0x7d/0xe0
iommu_device_register+0xe1/0x260
intel_iommu_init+0xda4/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
-> #0 (iommu_probe_device_lock){+.+.}-{3:3}:
__lock_acquire+0x1637/0x2810
lock_acquire+0xc9/0x300
__mutex_lock+0xb4/0xe40
mutex_lock_nested+0x1b/0x30
iommu_probe_device+0x1d/0x70
intel_iommu_init+0xe90/0x11f0
pci_iommu_init+0x13/0x70
do_one_initcall+0x62/0x3f0
kernel_init_freeable+0x3da/0x6a0
kernel_init+0x1b/0x200
ret_from_fork+0x44/0x70
ret_from_fork_asm+0x1a/0x30
other info that might help us debug this:
Chain exists of:
iommu_probe_device_lock --> dmar_global_lock -->
&device->physical_node_lock
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&device->physical_node_lock);
lock(dmar_global_lock);
lock(&device->physical_node_lock);
lock(iommu_probe_device_lock);
*** DEADLOCK ***
This driver uses a global lock to protect the list of enumerated DMA
remapping units. It is necessary due to the driver's support for dynamic
addition and removal of remapping units at runtime.
Two distinct code paths require iteration over this remapping unit list:
- Device registration and probing: the driver iterates the list to
register each remapping unit with the upper layer IOMMU framework
and subsequently probe the devices managed by that unit.
- Global configuration: Upper layer components may also iterate the list
to apply configuration changes.
The lock acquisition order between these two code paths was reversed. This
caused lockdep warnings, indicating a risk of deadlock. Fix this warning
by releasing the global lock before invoking upper layer interfaces for
device registration.
Fixes:
|