mirror of https://github.com/torvalds/linux.git
183 lines
8.3 KiB
ReStructuredText
183 lines
8.3 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
============================
|
|
PCI Peer-to-Peer DMA Support
|
|
============================
|
|
|
|
The PCI bus has pretty decent support for performing DMA transfers
|
|
between two devices on the bus. This type of transaction is henceforth
|
|
called Peer-to-Peer (or P2P). However, there are a number of issues that
|
|
make P2P transactions tricky to do in a perfectly safe way.
|
|
|
|
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
|
|
until they reach a host bridge or root port. If the path includes PCIe switches
|
|
then based on the ACS settings the transaction can route entirely within
|
|
the PCIe hierarchy and never reach the root port. The kernel will evaluate
|
|
the PCIe topology and always permit P2P in these well-defined cases.
|
|
|
|
However, if the P2P transaction reaches the host bridge then it might have to
|
|
hairpin back out the same root port, be routed inside the CPU SOC to another
|
|
PCIe root port, or routed internally to the SOC.
|
|
|
|
The PCIe specification doesn't define the forwarding of transactions between
|
|
hierarchy domains and kernel defaults to blocking such routing. There is an
|
|
allow list to allow detecting known-good HW, in which case P2P between any
|
|
two PCIe devices will be permitted.
|
|
|
|
Since P2P inherently is doing transactions between two devices it requires two
|
|
drivers to be co-operating inside the kernel. The providing driver has to convey
|
|
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
|
|
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
|
|
table mappings undone before the providing driver completes remove().
|
|
|
|
This requires the providing and consuming driver to actively work together to
|
|
guarantee that the consuming driver has stopped using the MMIO during a removal
|
|
cycle. This is done by either a synchronous invalidation shutdown or waiting
|
|
for all usage refcounts to reach zero.
|
|
|
|
At the lowest level the P2P subsystem offers a naked struct p2p_provider that
|
|
delegates lifecycle management to the providing driver. It is expected that
|
|
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
|
|
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
|
|
if used with mmap() must create special PTEs. As such there are very few
|
|
kernel uAPIs that can accept pointers to them; in particular they cannot be used
|
|
with read()/write(), including O_DIRECT.
|
|
|
|
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
|
|
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
|
|
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
|
|
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
|
|
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
|
|
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
|
|
it also relies on architecture support along with alignment and minimum size
|
|
limitations.
|
|
|
|
|
|
Driver Writer's Guide
|
|
=====================
|
|
|
|
In a given P2P implementation there may be three or more different
|
|
types of kernel drivers in play:
|
|
|
|
* Provider - A driver which provides or publishes P2P resources like
|
|
memory or doorbell registers to other drivers.
|
|
* Client - A driver which makes use of a resource by setting up a
|
|
DMA transaction to or from it.
|
|
* Orchestrator - A driver which orchestrates the flow of data between
|
|
clients and providers.
|
|
|
|
In many cases there could be overlap between these three types (i.e.,
|
|
it may be typical for a driver to be both a provider and a client).
|
|
|
|
For example, in the NVMe Target Copy Offload implementation:
|
|
|
|
* The NVMe PCI driver is both a client, provider and orchestrator
|
|
in that it exposes any CMB (Controller Memory Buffer) as a P2P memory
|
|
resource (provider), it accepts P2P memory pages as buffers in requests
|
|
to be used directly (client) and it can also make use of the CMB as
|
|
submission queue entries (orchestrator).
|
|
* The RDMA driver is a client in this arrangement so that an RNIC
|
|
can DMA directly to the memory exposed by the NVMe device.
|
|
* The NVMe Target driver (nvmet) can orchestrate the data from the RNIC
|
|
to the P2P memory (CMB) and then to the NVMe device (and vice versa).
|
|
|
|
This is currently the only arrangement supported by the kernel but
|
|
one could imagine slight tweaks to this that would allow for the same
|
|
functionality. For example, if a specific RNIC added a BAR with some
|
|
memory behind it, its driver could add support as a P2P provider and
|
|
then the NVMe Target could use the RNIC's memory instead of the CMB
|
|
in cases where the NVMe cards in use do not have CMB support.
|
|
|
|
|
|
Provider Drivers
|
|
----------------
|
|
|
|
A provider simply needs to register a BAR (or a portion of a BAR)
|
|
as a P2P DMA resource using :c:func:`pci_p2pdma_add_resource()`.
|
|
This will register struct pages for all the specified memory.
|
|
|
|
After that it may optionally publish all of its resources as
|
|
P2P memory using :c:func:`pci_p2pmem_publish()`. This will allow
|
|
any orchestrator drivers to find and use the memory. When marked in
|
|
this way, the resource must be regular memory with no side effects.
|
|
|
|
For the time being this is fairly rudimentary in that all resources
|
|
are typically going to be P2P memory. Future work will likely expand
|
|
this to include other types of resources like doorbells.
|
|
|
|
|
|
Client Drivers
|
|
--------------
|
|
|
|
A client driver only has to use the mapping API :c:func:`dma_map_sg()`
|
|
and :c:func:`dma_unmap_sg()` functions as usual, and the implementation
|
|
will do the right thing for the P2P capable memory.
|
|
|
|
|
|
Orchestrator Drivers
|
|
--------------------
|
|
|
|
The first task an orchestrator driver must do is compile a list of
|
|
all client devices that will be involved in a given transaction. For
|
|
example, the NVMe Target driver creates a list including the namespace
|
|
block device and the RNIC in use. If the orchestrator has access to
|
|
a specific P2P provider to use it may check compatibility using
|
|
:c:func:`pci_p2pdma_distance()` otherwise it may find a memory provider
|
|
that's compatible with all clients using :c:func:`pci_p2pmem_find()`.
|
|
If more than one provider is supported, the one nearest to all the clients will
|
|
be chosen first. If more than one provider is an equal distance away, the
|
|
one returned will be chosen at random (it is not an arbitrary but
|
|
truly random). This function returns the PCI device to use for the provider
|
|
with a reference taken and therefore when it's no longer needed it should be
|
|
returned with pci_dev_put().
|
|
|
|
Once a provider is selected, the orchestrator can then use
|
|
:c:func:`pci_alloc_p2pmem()` and :c:func:`pci_free_p2pmem()` to
|
|
allocate P2P memory from the provider. :c:func:`pci_p2pmem_alloc_sgl()`
|
|
and :c:func:`pci_p2pmem_free_sgl()` are convenience functions for
|
|
allocating scatter-gather lists with P2P memory.
|
|
|
|
Struct Page Caveats
|
|
-------------------
|
|
|
|
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
|
|
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
|
|
|
|
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
|
|
KVA is still MMIO and must still be accessed through the normal
|
|
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
|
|
like any other MMIO mapping. While this will actually work on some
|
|
architectures, others will experience corruption or just crash in the kernel.
|
|
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
|
|
access happens.
|
|
|
|
|
|
Usage With DMABUF
|
|
=================
|
|
|
|
DMABUF provides an alternative to the above struct page-based
|
|
client/provider/orchestrator system and should be used when struct page
|
|
doesn't exist. In this mode the exporting driver will wrap
|
|
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
|
|
|
|
Userspace can then pass the FD to an importing driver which will ask the
|
|
exporting driver to map it to the importer.
|
|
|
|
In this case the initiator and target pci_devices are known and the P2P subsystem
|
|
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
|
|
establish the dma_addr_t.
|
|
|
|
Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
|
|
to remove() it must deliver an invalidation shutdown to all DMABUF importing
|
|
drivers through move_notify() and synchronously DMA unmap all the MMIO.
|
|
|
|
No importing driver can continue to have a DMA map to the MMIO after the
|
|
exporting driver has destroyed its p2p_provider.
|
|
|
|
|
|
P2P DMA Support Library
|
|
=======================
|
|
|
|
.. kernel-doc:: drivers/pci/p2pdma.c
|
|
:export:
|