[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf

https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQvbATlQL0amee4qQI5ubbjuwiyIFAmkf53sRHGFsZXhAc2hh
 emJvdC5vcmcACgkQI5ubbjuwiyJv5w//TdfL5p6yz8O9CxJCQrm0W6raqDx+LE7u
 MNyCktSdokkPKSz/ms10vgl9CGqpVzDHNlgmVBGkAFQaRYQKUgMryp1IQ0jlspz0
 Ee1zy6AtlMemAyL1bybnk6yvc2nh/xa3LHa4FJ6sgL3KKnt9KjpY4sGGV/KlNfJV
 lYs+20+NyNU1rgyPtkHcrCrcYTMkDumvHDsn51O8Zx12b++qkZuf3b+mcWNNlNam
 DJl58O9tio0ol5a4rf63BxgPROgEVqs4G4rSRelJqr6g8IIFttihplhyZ83af9sC
 jtV0NEqsWt0nrKZtg1N9IIBgfQho5eamB99J0cPU0dhZSKXzBwIalBUb6zKZeVVY
 QEN6ZLoynpKRpZ1bhe3EhduNE1LOm2+wOJ7s93gSAtvsSXHHGDf1cXHL5CzKMLwG
 76fkO6b0m60mwXyjTgEVjqE9GTcgZLc9SKI9GN303v53W5Y1BOtW9B4QyX9eEPui
 qqtSbqXMsYx95lJASkjwE+u2b33mFrks8UC7Xg1nZkJWLy6hDPbQ6AIcM0mHd1EE
 2UJNJAzeUtzr0Vd3B92RkT0BPY299XlUUp50In42/g5y6IMSO7R/jZXrQyEXEDez
 dsqedYJE0vzO07pfkRYh1TJtwIl0WfdNnqClOKSlolfSn0vzI5qz9RKkpHM5T8Ou
 FIIb7Bf3hvs=
 =li88
 -----END PGP SIGNATURE-----

Merge tag 'vfio-v6.19-dma-buf-v9+' into v6.19/vfio/next

[v9] vfio/pci: Allow MMIO regions to be exported through dma-buf

https://lore.kernel.org/all/20251120-dmabuf-vfio-v9-0-d7f71607f371@nvidia.com

Signed-off-by: Alex Williamson <alex@shazbot.org>
This commit is contained in:
Alex Williamson 2025-11-20 21:20:00 -07:00
commit fa804aa4ac
23 changed files with 1101 additions and 141 deletions

View File

@ -9,22 +9,48 @@ between two devices on the bus. This type of transaction is henceforth
called Peer-to-Peer (or P2P). However, there are a number of issues that
make P2P transactions tricky to do in a perfectly safe way.
One of the biggest issues is that PCI doesn't require forwarding
transactions between hierarchy domains, and in PCIe, each Root Port
defines a separate hierarchy domain. To make things worse, there is no
simple way to determine if a given Root Complex supports this or not.
(See PCIe r4.0, sec 1.3.1). Therefore, as of this writing, the kernel
only supports doing P2P when the endpoints involved are all behind the
same PCI bridge, as such devices are all in the same PCI hierarchy
domain, and the spec guarantees that all transactions within the
hierarchy will be routable, but it does not require routing
between hierarchies.
For PCIe the routing of Transaction Layer Packets (TLPs) is well-defined up
until they reach a host bridge or root port. If the path includes PCIe switches
then based on the ACS settings the transaction can route entirely within
the PCIe hierarchy and never reach the root port. The kernel will evaluate
the PCIe topology and always permit P2P in these well-defined cases.
The second issue is that to make use of existing interfaces in Linux,
memory that is used for P2P transactions needs to be backed by struct
pages. However, PCI BARs are not typically cache coherent so there are
a few corner case gotchas with these pages so developers need to
be careful about what they do with them.
However, if the P2P transaction reaches the host bridge then it might have to
hairpin back out the same root port, be routed inside the CPU SOC to another
PCIe root port, or routed internally to the SOC.
The PCIe specification doesn't define the forwarding of transactions between
hierarchy domains and kernel defaults to blocking such routing. There is an
allow list to allow detecting known-good HW, in which case P2P between any
two PCIe devices will be permitted.
Since P2P inherently is doing transactions between two devices it requires two
drivers to be co-operating inside the kernel. The providing driver has to convey
its MMIO to the consuming driver. To meet the driver model lifecycle rules the
MMIO must have all DMA mapping removed, all CPU accesses prevented, all page
table mappings undone before the providing driver completes remove().
This requires the providing and consuming driver to actively work together to
guarantee that the consuming driver has stopped using the MMIO during a removal
cycle. This is done by either a synchronous invalidation shutdown or waiting
for all usage refcounts to reach zero.
At the lowest level the P2P subsystem offers a naked struct p2p_provider that
delegates lifecycle management to the providing driver. It is expected that
drivers using this option will wrap their MMIO memory in DMABUF and use DMABUF
to provide an invalidation shutdown. These MMIO addresess have no struct page, and
if used with mmap() must create special PTEs. As such there are very few
kernel uAPIs that can accept pointers to them; in particular they cannot be used
with read()/write(), including O_DIRECT.
Building on this, the subsystem offers a layer to wrap the MMIO in a ZONE_DEVICE
pgmap of MEMORY_DEVICE_PCI_P2PDMA to create struct pages. The lifecycle of
pgmap ensures that when the pgmap is destroyed all other drivers have stopped
using the MMIO. This option works with O_DIRECT flows, in some cases, if the
underlying subsystem supports handling MEMORY_DEVICE_PCI_P2PDMA through
FOLL_PCI_P2PDMA. The use of FOLL_LONGTERM is prevented. As this relies on pgmap
it also relies on architecture support along with alignment and minimum size
limitations.
Driver Writer's Guide
@ -114,14 +140,39 @@ allocating scatter-gather lists with P2P memory.
Struct Page Caveats
-------------------
Driver writers should be very careful about not passing these special
struct pages to code that isn't prepared for it. At this time, the kernel
interfaces do not have any checks for ensuring this. This obviously
precludes passing these pages to userspace.
While the MEMORY_DEVICE_PCI_P2PDMA pages can be installed in VMAs,
pin_user_pages() and related will not return them unless FOLL_PCI_P2PDMA is set.
P2P memory is also technically IO memory but should never have any side
effects behind it. Thus, the order of loads and stores should not be important
and ioreadX(), iowriteX() and friends should not be necessary.
The MEMORY_DEVICE_PCI_P2PDMA pages require care to support in the kernel. The
KVA is still MMIO and must still be accessed through the normal
readX()/writeX()/etc helpers. Direct CPU access (e.g. memcpy) is forbidden, just
like any other MMIO mapping. While this will actually work on some
architectures, others will experience corruption or just crash in the kernel.
Supporting FOLL_PCI_P2PDMA in a subsystem requires scrubbing it to ensure no CPU
access happens.
Usage With DMABUF
=================
DMABUF provides an alternative to the above struct page-based
client/provider/orchestrator system and should be used when struct page
doesn't exist. In this mode the exporting driver will wrap
some of its MMIO in a DMABUF and give the DMABUF FD to userspace.
Userspace can then pass the FD to an importing driver which will ask the
exporting driver to map it to the importer.
In this case the initiator and target pci_devices are known and the P2P subsystem
is used to determine the mapping type. The phys_addr_t-based DMA API is used to
establish the dma_addr_t.
Lifecycle is controlled by DMABUF move_notify(). When the exporting driver wants
to remove() it must deliver an invalidation shutdown to all DMABUF importing
drivers through move_notify() and synchronously DMA unmap all the MMIO.
No importing driver can continue to have a DMA map to the MMIO after the
exporting driver has destroyed its p2p_provider.
P2P DMA Support Library

View File

@ -85,7 +85,7 @@ static inline bool blk_can_dma_map_iova(struct request *req,
static bool blk_dma_map_bus(struct blk_dma_iter *iter, struct phys_vec *vec)
{
iter->addr = pci_p2pdma_bus_addr_map(&iter->p2pdma, vec->paddr);
iter->addr = pci_p2pdma_bus_addr_map(iter->p2pdma.mem, vec->paddr);
iter->len = vec->len;
return true;
}

View File

@ -1,6 +1,6 @@
# SPDX-License-Identifier: GPL-2.0-only
obj-y := dma-buf.o dma-fence.o dma-fence-array.o dma-fence-chain.o \
dma-fence-unwrap.o dma-resv.o
dma-fence-unwrap.o dma-resv.o dma-buf-mapping.o
obj-$(CONFIG_DMABUF_HEAPS) += dma-heap.o
obj-$(CONFIG_DMABUF_HEAPS) += heaps/
obj-$(CONFIG_SYNC_FILE) += sync_file.o

View File

@ -0,0 +1,248 @@
// SPDX-License-Identifier: GPL-2.0-only
/*
* DMA BUF Mapping Helpers
*
*/
#include <linux/dma-buf-mapping.h>
#include <linux/dma-resv.h>
static struct scatterlist *fill_sg_entry(struct scatterlist *sgl, size_t length,
dma_addr_t addr)
{
unsigned int len, nents;
int i;
nents = DIV_ROUND_UP(length, UINT_MAX);
for (i = 0; i < nents; i++) {
len = min_t(size_t, length, UINT_MAX);
length -= len;
/*
* DMABUF abuses scatterlist to create a scatterlist
* that does not have any CPU list, only the DMA list.
* Always set the page related values to NULL to ensure
* importers can't use it. The phys_addr based DMA API
* does not require the CPU list for mapping or unmapping.
*/
sg_set_page(sgl, NULL, 0, 0);
sg_dma_address(sgl) = addr + i * UINT_MAX;
sg_dma_len(sgl) = len;
sgl = sg_next(sgl);
}
return sgl;
}
static unsigned int calc_sg_nents(struct dma_iova_state *state,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size)
{
unsigned int nents = 0;
size_t i;
if (!state || !dma_use_iova(state)) {
for (i = 0; i < nr_ranges; i++)
nents += DIV_ROUND_UP(phys_vec[i].len, UINT_MAX);
} else {
/*
* In IOVA case, there is only one SG entry which spans
* for whole IOVA address space, but we need to make sure
* that it fits sg->length, maybe we need more.
*/
nents = DIV_ROUND_UP(size, UINT_MAX);
}
return nents;
}
/**
* struct dma_buf_dma - holds DMA mapping information
* @sgt: Scatter-gather table
* @state: DMA IOVA state relevant in IOMMU-based DMA
* @size: Total size of DMA transfer
*/
struct dma_buf_dma {
struct sg_table sgt;
struct dma_iova_state *state;
size_t size;
};
/**
* dma_buf_phys_vec_to_sgt - Returns the scatterlist table of the attachment
* from arrays of physical vectors. This funciton is intended for MMIO memory
* only.
* @attach: [in] attachment whose scatterlist is to be returned
* @provider: [in] p2pdma provider
* @phys_vec: [in] array of physical vectors
* @nr_ranges: [in] number of entries in phys_vec array
* @size: [in] total size of phys_vec
* @dir: [in] direction of DMA transfer
*
* Returns sg_table containing the scatterlist to be returned; returns ERR_PTR
* on error. May return -EINTR if it is interrupted by a signal.
*
* On success, the DMA addresses and lengths in the returned scatterlist are
* PAGE_SIZE aligned.
*
* A mapping must be unmapped by using dma_buf_free_sgt().
*
* NOTE: This function is intended for exporters. If direct traffic routing is
* mandatory exporter should call routing pci_p2pdma_map_type() before calling
* this function.
*/
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
struct p2pdma_provider *provider,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size,
enum dma_data_direction dir)
{
unsigned int nents, mapped_len = 0;
struct dma_buf_dma *dma;
struct scatterlist *sgl;
dma_addr_t addr;
size_t i;
int ret;
dma_resv_assert_held(attach->dmabuf->resv);
if (WARN_ON(!attach || !attach->dmabuf || !provider))
/* This function is supposed to work on MMIO memory only */
return ERR_PTR(-EINVAL);
dma = kzalloc(sizeof(*dma), GFP_KERNEL);
if (!dma)
return ERR_PTR(-ENOMEM);
switch (pci_p2pdma_map_type(provider, attach->dev)) {
case PCI_P2PDMA_MAP_BUS_ADDR:
/*
* There is no need in IOVA at all for this flow.
*/
break;
case PCI_P2PDMA_MAP_THRU_HOST_BRIDGE:
dma->state = kzalloc(sizeof(*dma->state), GFP_KERNEL);
if (!dma->state) {
ret = -ENOMEM;
goto err_free_dma;
}
dma_iova_try_alloc(attach->dev, dma->state, 0, size);
break;
default:
ret = -EINVAL;
goto err_free_dma;
}
nents = calc_sg_nents(dma->state, phys_vec, nr_ranges, size);
ret = sg_alloc_table(&dma->sgt, nents, GFP_KERNEL | __GFP_ZERO);
if (ret)
goto err_free_state;
sgl = dma->sgt.sgl;
for (i = 0; i < nr_ranges; i++) {
if (!dma->state) {
addr = pci_p2pdma_bus_addr_map(provider,
phys_vec[i].paddr);
} else if (dma_use_iova(dma->state)) {
ret = dma_iova_link(attach->dev, dma->state,
phys_vec[i].paddr, 0,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
if (ret)
goto err_unmap_dma;
mapped_len += phys_vec[i].len;
} else {
addr = dma_map_phys(attach->dev, phys_vec[i].paddr,
phys_vec[i].len, dir,
DMA_ATTR_MMIO);
ret = dma_mapping_error(attach->dev, addr);
if (ret)
goto err_unmap_dma;
}
if (!dma->state || !dma_use_iova(dma->state))
sgl = fill_sg_entry(sgl, phys_vec[i].len, addr);
}
if (dma->state && dma_use_iova(dma->state)) {
WARN_ON_ONCE(mapped_len != size);
ret = dma_iova_sync(attach->dev, dma->state, 0, mapped_len);
if (ret)
goto err_unmap_dma;
sgl = fill_sg_entry(sgl, mapped_len, dma->state->addr);
}
dma->size = size;
/*
* No CPU list included set orig_nents = 0 so others can detect
* this via SG table (use nents only).
*/
dma->sgt.orig_nents = 0;
/*
* SGL must be NULL to indicate that SGL is the last one
* and we allocated correct number of entries in sg_alloc_table()
*/
WARN_ON_ONCE(sgl);
return &dma->sgt;
err_unmap_dma:
if (!i || !dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, mapped_len, dir,
DMA_ATTR_MMIO);
} else {
for_each_sgtable_dma_sg(&dma->sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(&dma->sgt);
err_free_state:
kfree(dma->state);
err_free_dma:
kfree(dma);
return ERR_PTR(ret);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_phys_vec_to_sgt, "DMA_BUF");
/**
* dma_buf_free_sgt- unmaps the buffer
* @attach: [in] attachment to unmap buffer from
* @sgt: [in] scatterlist info of the buffer to unmap
* @dir: [in] direction of DMA transfer
*
* This unmaps a DMA mapping for @attached obtained
* by dma_buf_phys_vec_to_sgt().
*/
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
enum dma_data_direction dir)
{
struct dma_buf_dma *dma = container_of(sgt, struct dma_buf_dma, sgt);
int i;
dma_resv_assert_held(attach->dmabuf->resv);
if (!dma->state) {
; /* Do nothing */
} else if (dma_use_iova(dma->state)) {
dma_iova_destroy(attach->dev, dma->state, dma->size, dir,
DMA_ATTR_MMIO);
} else {
struct scatterlist *sgl;
for_each_sgtable_dma_sg(sgt, sgl, i)
dma_unmap_phys(attach->dev, sg_dma_address(sgl),
sg_dma_len(sgl), dir, DMA_ATTR_MMIO);
}
sg_free_table(sgt);
kfree(dma->state);
kfree(dma);
}
EXPORT_SYMBOL_NS_GPL(dma_buf_free_sgt, "DMA_BUF");

View File

@ -1439,8 +1439,8 @@ int iommu_dma_map_sg(struct device *dev, struct scatterlist *sg, int nents,
* as a bus address, __finalise_sg() will copy the dma
* address into the output segment.
*/
s->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
sg_phys(s));
s->dma_address = pci_p2pdma_bus_addr_map(
p2pdma_state.mem, sg_phys(s));
sg_dma_len(s) = sg->length;
sg_dma_mark_bus_address(s);
continue;

View File

@ -25,12 +25,12 @@ struct pci_p2pdma {
struct gen_pool *pool;
bool p2pmem_published;
struct xarray map_types;
struct p2pdma_provider mem[PCI_STD_NUM_BARS];
};
struct pci_p2pdma_pagemap {
struct pci_dev *provider;
u64 bus_offset;
struct dev_pagemap pgmap;
struct p2pdma_provider *mem;
};
static struct pci_p2pdma_pagemap *to_p2p_pgmap(struct dev_pagemap *pgmap)
@ -204,8 +204,8 @@ static void p2pdma_page_free(struct page *page)
{
struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
struct pci_p2pdma *p2pdma = rcu_dereference_protected(
to_pci_dev(pgmap->mem->owner)->p2pdma, 1);
struct percpu_ref *ref;
gen_pool_free_owner(p2pdma->pool, (uintptr_t)page_to_virt(page),
@ -228,56 +228,136 @@ static void pci_p2pdma_release(void *data)
/* Flush and disable pci_alloc_p2p_mem() */
pdev->p2pdma = NULL;
synchronize_rcu();
if (p2pdma->pool)
synchronize_rcu();
xa_destroy(&p2pdma->map_types);
if (!p2pdma->pool)
return;
gen_pool_destroy(p2pdma->pool);
sysfs_remove_group(&pdev->dev.kobj, &p2pmem_group);
xa_destroy(&p2pdma->map_types);
}
static int pci_p2pdma_setup(struct pci_dev *pdev)
/**
* pcim_p2pdma_init - Initialise peer-to-peer DMA providers
* @pdev: The PCI device to enable P2PDMA for
*
* This function initializes the peer-to-peer DMA infrastructure
* for a PCI device. It allocates and sets up the necessary data
* structures to support P2PDMA operations, including mapping type
* tracking.
*/
int pcim_p2pdma_init(struct pci_dev *pdev)
{
int error = -ENOMEM;
struct pci_p2pdma *p2p;
int i, ret;
p2p = rcu_dereference_protected(pdev->p2pdma, 1);
if (p2p)
return 0;
p2p = devm_kzalloc(&pdev->dev, sizeof(*p2p), GFP_KERNEL);
if (!p2p)
return -ENOMEM;
xa_init(&p2p->map_types);
/*
* Iterate over all standard PCI BARs and record only those that
* correspond to MMIO regions. Skip non-memory resources (e.g. I/O
* port BARs) since they cannot be used for peer-to-peer (P2P)
* transactions.
*/
for (i = 0; i < PCI_STD_NUM_BARS; i++) {
if (!(pci_resource_flags(pdev, i) & IORESOURCE_MEM))
continue;
p2p->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
if (!p2p->pool)
goto out;
p2p->mem[i].owner = &pdev->dev;
p2p->mem[i].bus_offset =
pci_bus_address(pdev, i) - pci_resource_start(pdev, i);
}
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
if (error)
goto out_pool_destroy;
error = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (error)
goto out_pool_destroy;
ret = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_release, pdev);
if (ret)
goto out_p2p;
rcu_assign_pointer(pdev->p2pdma, p2p);
return 0;
out_pool_destroy:
gen_pool_destroy(p2p->pool);
out:
out_p2p:
devm_kfree(&pdev->dev, p2p);
return error;
return ret;
}
EXPORT_SYMBOL_GPL(pcim_p2pdma_init);
/**
* pcim_p2pdma_provider - Get peer-to-peer DMA provider
* @pdev: The PCI device to enable P2PDMA for
* @bar: BAR index to get provider
*
* This function gets peer-to-peer DMA provider for a PCI device. The lifetime
* of the provider (and of course the MMIO) is bound to the lifetime of the
* driver. A driver calling this function must ensure that all references to the
* provider, and any DMA mappings created for any MMIO, are all cleaned up
* before the driver remove() completes.
*
* Since P2P is almost always shared with a second driver this means some system
* to notify, invalidate and revoke the MMIO's DMA must be in place to use this
* function. For example a revoke can be built using DMABUF.
*/
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar)
{
struct pci_p2pdma *p2p;
if (!(pci_resource_flags(pdev, bar) & IORESOURCE_MEM))
return NULL;
p2p = rcu_dereference_protected(pdev->p2pdma, 1);
if (WARN_ON(!p2p))
/* Someone forgot to call to pcim_p2pdma_init() before */
return NULL;
return &p2p->mem[bar];
}
EXPORT_SYMBOL_GPL(pcim_p2pdma_provider);
static int pci_p2pdma_setup_pool(struct pci_dev *pdev)
{
struct pci_p2pdma *p2pdma;
int ret;
p2pdma = rcu_dereference_protected(pdev->p2pdma, 1);
if (p2pdma->pool)
/* We already setup pools, do nothing, */
return 0;
p2pdma->pool = gen_pool_create(PAGE_SHIFT, dev_to_node(&pdev->dev));
if (!p2pdma->pool)
return -ENOMEM;
ret = sysfs_create_group(&pdev->dev.kobj, &p2pmem_group);
if (ret)
goto out_pool_destroy;
return 0;
out_pool_destroy:
gen_pool_destroy(p2pdma->pool);
p2pdma->pool = NULL;
return ret;
}
static void pci_p2pdma_unmap_mappings(void *data)
{
struct pci_dev *pdev = data;
struct pci_p2pdma_pagemap *p2p_pgmap = data;
/*
* Removing the alloc attribute from sysfs will call
* unmap_mapping_range() on the inode, teardown any existing userspace
* mappings and prevent new ones from being created.
*/
sysfs_remove_file_from_group(&pdev->dev.kobj, &p2pmem_alloc_attr.attr,
sysfs_remove_file_from_group(&p2p_pgmap->mem->owner->kobj,
&p2pmem_alloc_attr.attr,
p2pmem_group.name);
}
@ -295,6 +375,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset)
{
struct pci_p2pdma_pagemap *p2p_pgmap;
struct p2pdma_provider *mem;
struct dev_pagemap *pgmap;
struct pci_p2pdma *p2pdma;
void *addr;
@ -312,11 +393,21 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
if (size + offset > pci_resource_len(pdev, bar))
return -EINVAL;
if (!pdev->p2pdma) {
error = pci_p2pdma_setup(pdev);
if (error)
return error;
}
error = pcim_p2pdma_init(pdev);
if (error)
return error;
error = pci_p2pdma_setup_pool(pdev);
if (error)
return error;
mem = pcim_p2pdma_provider(pdev, bar);
/*
* We checked validity of BAR prior to call
* to pcim_p2pdma_provider. It should never return NULL.
*/
if (WARN_ON(!mem))
return -EINVAL;
p2p_pgmap = devm_kzalloc(&pdev->dev, sizeof(*p2p_pgmap), GFP_KERNEL);
if (!p2p_pgmap)
@ -328,10 +419,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
pgmap->nr_range = 1;
pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
pgmap->ops = &p2pdma_pgmap_ops;
p2p_pgmap->provider = pdev;
p2p_pgmap->bus_offset = pci_bus_address(pdev, bar) -
pci_resource_start(pdev, bar);
p2p_pgmap->mem = mem;
addr = devm_memremap_pages(&pdev->dev, pgmap);
if (IS_ERR(addr)) {
@ -340,7 +428,7 @@ int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
}
error = devm_add_action_or_reset(&pdev->dev, pci_p2pdma_unmap_mappings,
pdev);
p2p_pgmap);
if (error)
goto pages_free;
@ -972,16 +1060,26 @@ void pci_p2pmem_publish(struct pci_dev *pdev, bool publish)
}
EXPORT_SYMBOL_GPL(pci_p2pmem_publish);
static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
struct device *dev)
/**
* pci_p2pdma_map_type - Determine the mapping type for P2PDMA transfers
* @provider: P2PDMA provider structure
* @dev: Target device for the transfer
*
* Determines how peer-to-peer DMA transfers should be mapped between
* the provider and the target device. The mapping type indicates whether
* the transfer can be done directly through PCI switches or must go
* through the host bridge.
*/
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
struct device *dev)
{
enum pci_p2pdma_map_type type = PCI_P2PDMA_MAP_NOT_SUPPORTED;
struct pci_dev *provider = to_p2p_pgmap(pgmap)->provider;
struct pci_dev *pdev = to_pci_dev(provider->owner);
struct pci_dev *client;
struct pci_p2pdma *p2pdma;
int dist;
if (!provider->p2pdma)
if (!pdev->p2pdma)
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
if (!dev_is_pci(dev))
@ -990,7 +1088,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
client = to_pci_dev(dev);
rcu_read_lock();
p2pdma = rcu_dereference(provider->p2pdma);
p2pdma = rcu_dereference(pdev->p2pdma);
if (p2pdma)
type = xa_to_value(xa_load(&p2pdma->map_types,
@ -998,7 +1096,7 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
rcu_read_unlock();
if (type == PCI_P2PDMA_MAP_UNKNOWN)
return calc_map_type_and_dist(provider, client, &dist, true);
return calc_map_type_and_dist(pdev, client, &dist, true);
return type;
}
@ -1006,9 +1104,13 @@ static enum pci_p2pdma_map_type pci_p2pdma_map_type(struct dev_pagemap *pgmap,
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
struct device *dev, struct page *page)
{
state->pgmap = page_pgmap(page);
state->map = pci_p2pdma_map_type(state->pgmap, dev);
state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
struct pci_p2pdma_pagemap *p2p_pgmap = to_p2p_pgmap(page_pgmap(page));
if (state->mem == p2p_pgmap->mem)
return;
state->mem = p2p_pgmap->mem;
state->map = pci_p2pdma_map_type(p2p_pgmap->mem, dev);
}
/**

View File

@ -55,6 +55,9 @@ config VFIO_PCI_ZDEV_KVM
To enable s390x KVM vfio-pci extensions, say Y.
config VFIO_PCI_DMABUF
def_bool y if VFIO_PCI_CORE && PCI_P2PDMA && DMA_SHARED_BUFFER
source "drivers/vfio/pci/mlx5/Kconfig"
source "drivers/vfio/pci/hisilicon/Kconfig"

View File

@ -2,6 +2,7 @@
vfio-pci-core-y := vfio_pci_core.o vfio_pci_intrs.o vfio_pci_rdwr.o vfio_pci_config.o
vfio-pci-core-$(CONFIG_VFIO_PCI_ZDEV_KVM) += vfio_pci_zdev.o
vfio-pci-core-$(CONFIG_VFIO_PCI_DMABUF) += vfio_pci_dmabuf.o
obj-$(CONFIG_VFIO_PCI_CORE) += vfio-pci-core.o
vfio-pci-y := vfio_pci.o

View File

@ -7,6 +7,7 @@
#include <linux/vfio_pci_core.h>
#include <linux/delay.h>
#include <linux/jiffies.h>
#include <linux/pci-p2pdma.h>
/*
* The device memory usable to the workloads running in the VM is cached
@ -652,6 +653,50 @@ nvgrace_gpu_write(struct vfio_device *core_vdev,
return vfio_pci_core_write(core_vdev, buf, count, ppos);
}
static int nvgrace_get_dmabuf_phys(struct vfio_pci_core_device *core_vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges)
{
struct nvgrace_gpu_pci_core_device *nvdev = container_of(
core_vdev, struct nvgrace_gpu_pci_core_device, core_device);
struct pci_dev *pdev = core_vdev->pdev;
struct mem_region *mem_region;
/*
* if (nvdev->resmem.memlength && region_index == RESMEM_REGION_INDEX) {
* The P2P properties of the non-BAR memory is the same as the
* BAR memory, so just use the provider for index 0. Someday
* when CXL gets P2P support we could create CXLish providers
* for the non-BAR memory.
* } else if (region_index == USEMEM_REGION_INDEX) {
* This is actually cachable memory and isn't treated as P2P in
* the chip. For now we have no way to push cachable memory
* through everything and the Grace HW doesn't care what caching
* attribute is programmed into the SMMU. So use BAR 0.
* }
*/
mem_region = nvgrace_gpu_memregion(region_index, nvdev);
if (mem_region) {
*provider = pcim_p2pdma_provider(pdev, 0);
if (!*provider)
return -EINVAL;
return vfio_pci_core_fill_phys_vec(phys_vec, dma_ranges,
nr_ranges,
mem_region->memphys,
mem_region->memlength);
}
return vfio_pci_core_get_dmabuf_phys(core_vdev, provider, region_index,
phys_vec, dma_ranges, nr_ranges);
}
static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_ops = {
.get_dmabuf_phys = nvgrace_get_dmabuf_phys,
};
static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
.name = "nvgrace-gpu-vfio-pci",
.init = vfio_pci_core_init_dev,
@ -673,6 +718,10 @@ static const struct vfio_device_ops nvgrace_gpu_pci_ops = {
.detach_ioas = vfio_iommufd_physical_detach_ioas,
};
static const struct vfio_pci_device_ops nvgrace_gpu_pci_dev_core_ops = {
.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
};
static const struct vfio_device_ops nvgrace_gpu_pci_core_ops = {
.name = "nvgrace-gpu-vfio-pci-core",
.init = vfio_pci_core_init_dev,
@ -936,6 +985,9 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
memphys, memlength);
if (ret)
goto out_put_vdev;
nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_ops;
} else {
nvdev->core_device.pci_ops = &nvgrace_gpu_pci_dev_core_ops;
}
ret = vfio_pci_core_register_device(&nvdev->core_device);

View File

@ -148,6 +148,10 @@ static const struct vfio_device_ops vfio_pci_ops = {
.pasid_detach_ioas = vfio_iommufd_physical_pasid_detach_ioas,
};
static const struct vfio_pci_device_ops vfio_pci_dev_ops = {
.get_dmabuf_phys = vfio_pci_core_get_dmabuf_phys,
};
static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
{
struct vfio_pci_core_device *vdev;
@ -162,6 +166,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
return PTR_ERR(vdev);
dev_set_drvdata(&pdev->dev, vdev);
vdev->pci_ops = &vfio_pci_dev_ops;
ret = vfio_pci_core_register_device(vdev);
if (ret)
goto out_put_vdev;

View File

@ -589,10 +589,12 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
virt_mem = !!(le16_to_cpu(*virt_cmd) & PCI_COMMAND_MEMORY);
new_mem = !!(new_cmd & PCI_COMMAND_MEMORY);
if (!new_mem)
if (!new_mem) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else {
down_write(&vdev->memory_lock);
}
/*
* If the user is writing mem/io enable (new_mem/io) and we
@ -627,6 +629,8 @@ static int vfio_basic_config_write(struct vfio_pci_core_device *vdev, int pos,
*virt_cmd &= cpu_to_le16(~mask);
*virt_cmd |= cpu_to_le16(new_cmd & mask);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@ -707,12 +711,16 @@ static int __init init_pci_cap_basic_perm(struct perm_bits *perm)
static void vfio_lock_and_set_power_state(struct vfio_pci_core_device *vdev,
pci_power_t state)
{
if (state >= PCI_D3hot)
if (state >= PCI_D3hot) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
else
vfio_pci_dma_buf_move(vdev, true);
} else {
down_write(&vdev->memory_lock);
}
vfio_pci_set_power_state(vdev, state);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
@ -900,7 +908,10 @@ static int vfio_exp_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_EXP_DEVCAP_FLR)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
}
@ -982,7 +993,10 @@ static int vfio_af_config_write(struct vfio_pci_core_device *vdev, int pos,
if (!ret && (cap & PCI_AF_CAP_FLR) && (cap & PCI_AF_CAP_TP)) {
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
}

View File

@ -28,6 +28,7 @@
#include <linux/nospec.h>
#include <linux/sched/mm.h>
#include <linux/iommufd.h>
#include <linux/pci-p2pdma.h>
#if IS_ENABLED(CONFIG_EEH)
#include <asm/eeh.h>
#endif
@ -286,6 +287,8 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
* semaphore.
*/
vfio_pci_zap_and_down_write_memory_lock(vdev);
vfio_pci_dma_buf_move(vdev, true);
if (vdev->pm_runtime_engaged) {
up_write(&vdev->memory_lock);
return -EINVAL;
@ -299,11 +302,9 @@ static int vfio_pci_runtime_pm_entry(struct vfio_pci_core_device *vdev,
return 0;
}
static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
static int vfio_pci_core_pm_entry(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@ -320,12 +321,10 @@ static int vfio_pci_core_pm_entry(struct vfio_device *device, u32 flags,
}
static int vfio_pci_core_pm_entry_with_wakeup(
struct vfio_device *device, u32 flags,
struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_low_power_entry_with_wakeup __user *arg,
size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
struct vfio_device_low_power_entry_with_wakeup entry;
struct eventfd_ctx *efdctx;
int ret;
@ -373,14 +372,14 @@ static void vfio_pci_runtime_pm_exit(struct vfio_pci_core_device *vdev)
*/
down_write(&vdev->memory_lock);
__vfio_pci_runtime_pm_exit(vdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
static int vfio_pci_core_pm_exit(struct vfio_device *device, u32 flags,
static int vfio_pci_core_pm_exit(struct vfio_pci_core_device *vdev, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
int ret;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_SET, 0);
@ -695,6 +694,8 @@ void vfio_pci_core_close_device(struct vfio_device *core_vdev)
#endif
vfio_pci_core_disable(vdev);
vfio_pci_dma_buf_cleanup(vdev);
mutex_lock(&vdev->igate);
if (vdev->err_trigger) {
eventfd_ctx_put(vdev->err_trigger);
@ -1205,7 +1206,10 @@ static int vfio_pci_ioctl_reset(struct vfio_pci_core_device *vdev,
*/
vfio_pci_set_power_state(vdev, PCI_D0);
vfio_pci_dma_buf_move(vdev, true);
ret = pci_try_reset_function(vdev->pdev);
if (__vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
return ret;
@ -1449,11 +1453,10 @@ long vfio_pci_core_ioctl(struct vfio_device *core_vdev, unsigned int cmd,
}
EXPORT_SYMBOL_GPL(vfio_pci_core_ioctl);
static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
uuid_t __user *arg, size_t argsz)
static int vfio_pci_core_feature_token(struct vfio_pci_core_device *vdev,
u32 flags, uuid_t __user *arg,
size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
uuid_t uuid;
int ret;
@ -1480,16 +1483,21 @@ static int vfio_pci_core_feature_token(struct vfio_device *device, u32 flags,
int vfio_pci_core_ioctl_feature(struct vfio_device *device, u32 flags,
void __user *arg, size_t argsz)
{
struct vfio_pci_core_device *vdev =
container_of(device, struct vfio_pci_core_device, vdev);
switch (flags & VFIO_DEVICE_FEATURE_MASK) {
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY:
return vfio_pci_core_pm_entry(device, flags, arg, argsz);
return vfio_pci_core_pm_entry(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_ENTRY_WITH_WAKEUP:
return vfio_pci_core_pm_entry_with_wakeup(device, flags,
return vfio_pci_core_pm_entry_with_wakeup(vdev, flags,
arg, argsz);
case VFIO_DEVICE_FEATURE_LOW_POWER_EXIT:
return vfio_pci_core_pm_exit(device, flags, arg, argsz);
return vfio_pci_core_pm_exit(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_PCI_VF_TOKEN:
return vfio_pci_core_feature_token(device, flags, arg, argsz);
return vfio_pci_core_feature_token(vdev, flags, arg, argsz);
case VFIO_DEVICE_FEATURE_DMA_BUF:
return vfio_pci_core_feature_dma_buf(vdev, flags, arg, argsz);
default:
return -ENOTTY;
}
@ -2061,6 +2069,7 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
{
struct vfio_pci_core_device *vdev =
container_of(core_vdev, struct vfio_pci_core_device, vdev);
int ret;
vdev->pdev = to_pci_dev(core_vdev->dev);
vdev->irq_type = VFIO_PCI_NUM_IRQS;
@ -2070,6 +2079,10 @@ int vfio_pci_core_init_dev(struct vfio_device *core_vdev)
INIT_LIST_HEAD(&vdev->dummy_resources_list);
INIT_LIST_HEAD(&vdev->ioeventfds_list);
INIT_LIST_HEAD(&vdev->sriov_pfs_item);
ret = pcim_p2pdma_init(vdev->pdev);
if (ret && ret != -EOPNOTSUPP)
return ret;
INIT_LIST_HEAD(&vdev->dmabufs);
init_rwsem(&vdev->memory_lock);
xa_init(&vdev->ctx);
@ -2434,6 +2447,7 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
break;
}
vfio_pci_dma_buf_move(vdev, true);
vfio_pci_zap_bars(vdev);
}
@ -2462,8 +2476,11 @@ static int vfio_pci_dev_set_hot_reset(struct vfio_device_set *dev_set,
err_undo:
list_for_each_entry_from_reverse(vdev, &dev_set->device_list,
vdev.dev_set_list)
vdev.dev_set_list) {
if (vdev->vdev.open_count && __vfio_pci_memory_enabled(vdev))
vfio_pci_dma_buf_move(vdev, false);
up_write(&vdev->memory_lock);
}
list_for_each_entry(vdev, &dev_set->device_list, vdev.dev_set_list)
pm_runtime_put(&vdev->pdev->dev);

View File

@ -0,0 +1,316 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES.
*/
#include <linux/dma-buf-mapping.h>
#include <linux/pci-p2pdma.h>
#include <linux/dma-resv.h>
#include "vfio_pci_priv.h"
MODULE_IMPORT_NS("DMA_BUF");
struct vfio_pci_dma_buf {
struct dma_buf *dmabuf;
struct vfio_pci_core_device *vdev;
struct list_head dmabufs_elm;
size_t size;
struct dma_buf_phys_vec *phys_vec;
struct p2pdma_provider *provider;
u32 nr_ranges;
u8 revoked : 1;
};
static int vfio_pci_dma_buf_attach(struct dma_buf *dmabuf,
struct dma_buf_attachment *attachment)
{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
if (!attachment->peer2peer)
return -EOPNOTSUPP;
if (priv->revoked)
return -ENODEV;
return 0;
}
static struct sg_table *
vfio_pci_dma_buf_map(struct dma_buf_attachment *attachment,
enum dma_data_direction dir)
{
struct vfio_pci_dma_buf *priv = attachment->dmabuf->priv;
dma_resv_assert_held(priv->dmabuf->resv);
if (priv->revoked)
return ERR_PTR(-ENODEV);
return dma_buf_phys_vec_to_sgt(attachment, priv->provider,
priv->phys_vec, priv->nr_ranges,
priv->size, dir);
}
static void vfio_pci_dma_buf_unmap(struct dma_buf_attachment *attachment,
struct sg_table *sgt,
enum dma_data_direction dir)
{
dma_buf_free_sgt(attachment, sgt, dir);
}
static void vfio_pci_dma_buf_release(struct dma_buf *dmabuf)
{
struct vfio_pci_dma_buf *priv = dmabuf->priv;
/*
* Either this or vfio_pci_dma_buf_cleanup() will remove from the list.
* The refcount prevents both.
*/
if (priv->vdev) {
down_write(&priv->vdev->memory_lock);
list_del_init(&priv->dmabufs_elm);
up_write(&priv->vdev->memory_lock);
vfio_device_put_registration(&priv->vdev->vdev);
}
kfree(priv->phys_vec);
kfree(priv);
}
static const struct dma_buf_ops vfio_pci_dmabuf_ops = {
.attach = vfio_pci_dma_buf_attach,
.map_dma_buf = vfio_pci_dma_buf_map,
.unmap_dma_buf = vfio_pci_dma_buf_unmap,
.release = vfio_pci_dma_buf_release,
};
int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len)
{
phys_addr_t max_addr;
unsigned int i;
max_addr = start + len;
for (i = 0; i < nr_ranges; i++) {
phys_addr_t end;
if (!dma_ranges[i].length)
return -EINVAL;
if (check_add_overflow(start, dma_ranges[i].offset,
&phys_vec[i].paddr) ||
check_add_overflow(phys_vec[i].paddr,
dma_ranges[i].length, &end))
return -EOVERFLOW;
if (end > max_addr)
return -EINVAL;
phys_vec[i].len = dma_ranges[i].length;
}
return 0;
}
EXPORT_SYMBOL_GPL(vfio_pci_core_fill_phys_vec);
int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges)
{
struct pci_dev *pdev = vdev->pdev;
*provider = pcim_p2pdma_provider(pdev, region_index);
if (!*provider)
return -EINVAL;
return vfio_pci_core_fill_phys_vec(
phys_vec, dma_ranges, nr_ranges,
pci_resource_start(pdev, region_index),
pci_resource_len(pdev, region_index));
}
EXPORT_SYMBOL_GPL(vfio_pci_core_get_dmabuf_phys);
static int validate_dmabuf_input(struct vfio_device_feature_dma_buf *dma_buf,
struct vfio_region_dma_range *dma_ranges,
size_t *lengthp)
{
size_t length = 0;
u32 i;
for (i = 0; i < dma_buf->nr_ranges; i++) {
u64 offset = dma_ranges[i].offset;
u64 len = dma_ranges[i].length;
if (!len || !PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;
if (check_add_overflow(length, len, &length))
return -EINVAL;
}
/*
* dma_iova_try_alloc() will WARN on if userspace proposes a size that
* is too big, eg with lots of ranges.
*/
if ((u64)(length) & DMA_IOVA_USE_SWIOTLB)
return -EINVAL;
*lengthp = length;
return 0;
}
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
{
struct vfio_device_feature_dma_buf get_dma_buf = {};
struct vfio_region_dma_range *dma_ranges;
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
struct vfio_pci_dma_buf *priv;
size_t length;
int ret;
if (!vdev->pci_ops || !vdev->pci_ops->get_dmabuf_phys)
return -EOPNOTSUPP;
ret = vfio_check_feature(flags, argsz, VFIO_DEVICE_FEATURE_GET,
sizeof(get_dma_buf));
if (ret != 1)
return ret;
if (copy_from_user(&get_dma_buf, arg, sizeof(get_dma_buf)))
return -EFAULT;
if (!get_dma_buf.nr_ranges || get_dma_buf.flags)
return -EINVAL;
/*
* For PCI the region_index is the BAR number like everything else.
*/
if (get_dma_buf.region_index >= VFIO_PCI_ROM_REGION_INDEX)
return -ENODEV;
dma_ranges = memdup_array_user(&arg->dma_ranges, get_dma_buf.nr_ranges,
sizeof(*dma_ranges));
if (IS_ERR(dma_ranges))
return PTR_ERR(dma_ranges);
ret = validate_dmabuf_input(&get_dma_buf, dma_ranges, &length);
if (ret)
goto err_free_ranges;
priv = kzalloc(sizeof(*priv), GFP_KERNEL);
if (!priv) {
ret = -ENOMEM;
goto err_free_ranges;
}
priv->phys_vec = kcalloc(get_dma_buf.nr_ranges, sizeof(*priv->phys_vec),
GFP_KERNEL);
if (!priv->phys_vec) {
ret = -ENOMEM;
goto err_free_priv;
}
priv->vdev = vdev;
priv->nr_ranges = get_dma_buf.nr_ranges;
priv->size = length;
ret = vdev->pci_ops->get_dmabuf_phys(vdev, &priv->provider,
get_dma_buf.region_index,
priv->phys_vec, dma_ranges,
priv->nr_ranges);
if (ret)
goto err_free_phys;
kfree(dma_ranges);
dma_ranges = NULL;
if (!vfio_device_try_get_registration(&vdev->vdev)) {
ret = -ENODEV;
goto err_free_phys;
}
exp_info.ops = &vfio_pci_dmabuf_ops;
exp_info.size = priv->size;
exp_info.flags = get_dma_buf.open_flags;
exp_info.priv = priv;
priv->dmabuf = dma_buf_export(&exp_info);
if (IS_ERR(priv->dmabuf)) {
ret = PTR_ERR(priv->dmabuf);
goto err_dev_put;
}
/* dma_buf_put() now frees priv */
INIT_LIST_HEAD(&priv->dmabufs_elm);
down_write(&vdev->memory_lock);
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = !__vfio_pci_memory_enabled(vdev);
list_add_tail(&priv->dmabufs_elm, &vdev->dmabufs);
dma_resv_unlock(priv->dmabuf->resv);
up_write(&vdev->memory_lock);
/*
* dma_buf_fd() consumes the reference, when the file closes the dmabuf
* will be released.
*/
ret = dma_buf_fd(priv->dmabuf, get_dma_buf.open_flags);
if (ret < 0)
goto err_dma_buf;
return ret;
err_dma_buf:
dma_buf_put(priv->dmabuf);
err_dev_put:
vfio_device_put_registration(&vdev->vdev);
err_free_phys:
kfree(priv->phys_vec);
err_free_priv:
kfree(priv);
err_free_ranges:
kfree(dma_ranges);
return ret;
}
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked)
{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
lockdep_assert_held_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!get_file_active(&priv->dmabuf->file))
continue;
if (priv->revoked != revoked) {
dma_resv_lock(priv->dmabuf->resv, NULL);
priv->revoked = revoked;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
}
fput(priv->dmabuf->file);
}
}
void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
{
struct vfio_pci_dma_buf *priv;
struct vfio_pci_dma_buf *tmp;
down_write(&vdev->memory_lock);
list_for_each_entry_safe(priv, tmp, &vdev->dmabufs, dmabufs_elm) {
if (!get_file_active(&priv->dmabuf->file))
continue;
dma_resv_lock(priv->dmabuf->resv, NULL);
list_del_init(&priv->dmabufs_elm);
priv->vdev = NULL;
priv->revoked = true;
dma_buf_move_notify(priv->dmabuf);
dma_resv_unlock(priv->dmabuf->resv);
vfio_device_put_registration(&vdev->vdev);
fput(priv->dmabuf->file);
}
up_write(&vdev->memory_lock);
}

View File

@ -107,4 +107,27 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
}
#ifdef CONFIG_VFIO_PCI_DMABUF
int vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz);
void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev);
void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev, bool revoked);
#else
static inline int
vfio_pci_core_feature_dma_buf(struct vfio_pci_core_device *vdev, u32 flags,
struct vfio_device_feature_dma_buf __user *arg,
size_t argsz)
{
return -ENOTTY;
}
static inline void vfio_pci_dma_buf_cleanup(struct vfio_pci_core_device *vdev)
{
}
static inline void vfio_pci_dma_buf_move(struct vfio_pci_core_device *vdev,
bool revoked)
{
}
#endif
#endif

View File

@ -172,11 +172,13 @@ void vfio_device_put_registration(struct vfio_device *device)
if (refcount_dec_and_test(&device->refcount))
complete(&device->comp);
}
EXPORT_SYMBOL_GPL(vfio_device_put_registration);
bool vfio_device_try_get_registration(struct vfio_device *device)
{
return refcount_inc_not_zero(&device->refcount);
}
EXPORT_SYMBOL_GPL(vfio_device_try_get_registration);
/*
* VFIO driver API

View File

@ -0,0 +1,17 @@
/* SPDX-License-Identifier: GPL-2.0-only */
/*
* DMA BUF Mapping Helpers
*
*/
#ifndef __DMA_BUF_MAPPING_H__
#define __DMA_BUF_MAPPING_H__
#include <linux/dma-buf.h>
struct sg_table *dma_buf_phys_vec_to_sgt(struct dma_buf_attachment *attach,
struct p2pdma_provider *provider,
struct dma_buf_phys_vec *phys_vec,
size_t nr_ranges, size_t size,
enum dma_data_direction dir);
void dma_buf_free_sgt(struct dma_buf_attachment *attach, struct sg_table *sgt,
enum dma_data_direction dir);
#endif

View File

@ -22,6 +22,7 @@
#include <linux/fs.h>
#include <linux/dma-fence.h>
#include <linux/wait.h>
#include <linux/pci-p2pdma.h>
struct device;
struct dma_buf;
@ -530,6 +531,16 @@ struct dma_buf_export_info {
void *priv;
};
/**
* struct dma_buf_phys_vec - describe continuous chunk of memory
* @paddr: physical address of that chunk
* @len: Length of this chunk
*/
struct dma_buf_phys_vec {
phys_addr_t paddr;
size_t len;
};
/**
* DEFINE_DMA_BUF_EXPORT_INFO - helper macro for exporters
* @name: export-info name

View File

@ -16,7 +16,58 @@
struct block_device;
struct scatterlist;
/**
* struct p2pdma_provider
*
* A p2pdma provider is a range of MMIO address space available to the CPU.
*/
struct p2pdma_provider {
struct device *owner;
u64 bus_offset;
};
enum pci_p2pdma_map_type {
/*
* PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
* the mapping type has been calculated. Exported routines for the API
* will never return this value.
*/
PCI_P2PDMA_MAP_UNKNOWN = 0,
/*
* Not a PCI P2PDMA transfer.
*/
PCI_P2PDMA_MAP_NONE,
/*
* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
* traverse the host bridge and the host bridge is not in the
* allowlist. DMA Mapping routines should return an error when
* this is returned.
*/
PCI_P2PDMA_MAP_NOT_SUPPORTED,
/*
* PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
* each other directly through a PCI switch and the transaction will
* not traverse the host bridge. Such a mapping should program
* the DMA engine with PCI bus addresses.
*/
PCI_P2PDMA_MAP_BUS_ADDR,
/*
* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
* to each other, but the transaction traverses a host bridge on the
* allowlist. In this case, a normal mapping either with CPU physical
* addresses (in the case of dma-direct) or IOVA addresses (in the
* case of IOMMUs) should be used to program the DMA engine.
*/
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};
#ifdef CONFIG_PCI_P2PDMA
int pcim_p2pdma_init(struct pci_dev *pdev);
struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev, int bar);
int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar, size_t size,
u64 offset);
int pci_p2pdma_distance_many(struct pci_dev *provider, struct device **clients,
@ -33,7 +84,18 @@ int pci_p2pdma_enable_store(const char *page, struct pci_dev **p2p_dev,
bool *use_p2pdma);
ssize_t pci_p2pdma_enable_show(char *page, struct pci_dev *p2p_dev,
bool use_p2pdma);
enum pci_p2pdma_map_type pci_p2pdma_map_type(struct p2pdma_provider *provider,
struct device *dev);
#else /* CONFIG_PCI_P2PDMA */
static inline int pcim_p2pdma_init(struct pci_dev *pdev)
{
return -EOPNOTSUPP;
}
static inline struct p2pdma_provider *pcim_p2pdma_provider(struct pci_dev *pdev,
int bar)
{
return NULL;
}
static inline int pci_p2pdma_add_resource(struct pci_dev *pdev, int bar,
size_t size, u64 offset)
{
@ -85,6 +147,11 @@ static inline ssize_t pci_p2pdma_enable_show(char *page,
{
return sprintf(page, "none\n");
}
static inline enum pci_p2pdma_map_type
pci_p2pdma_map_type(struct p2pdma_provider *provider, struct device *dev)
{
return PCI_P2PDMA_MAP_NOT_SUPPORTED;
}
#endif /* CONFIG_PCI_P2PDMA */
@ -99,51 +166,12 @@ static inline struct pci_dev *pci_p2pmem_find(struct device *client)
return pci_p2pmem_find_many(&client, 1);
}
enum pci_p2pdma_map_type {
/*
* PCI_P2PDMA_MAP_UNKNOWN: Used internally as an initial state before
* the mapping type has been calculated. Exported routines for the API
* will never return this value.
*/
PCI_P2PDMA_MAP_UNKNOWN = 0,
/*
* Not a PCI P2PDMA transfer.
*/
PCI_P2PDMA_MAP_NONE,
/*
* PCI_P2PDMA_MAP_NOT_SUPPORTED: Indicates the transaction will
* traverse the host bridge and the host bridge is not in the
* allowlist. DMA Mapping routines should return an error when
* this is returned.
*/
PCI_P2PDMA_MAP_NOT_SUPPORTED,
/*
* PCI_P2PDMA_MAP_BUS_ADDR: Indicates that two devices can talk to
* each other directly through a PCI switch and the transaction will
* not traverse the host bridge. Such a mapping should program
* the DMA engine with PCI bus addresses.
*/
PCI_P2PDMA_MAP_BUS_ADDR,
/*
* PCI_P2PDMA_MAP_THRU_HOST_BRIDGE: Indicates two devices can talk
* to each other, but the transaction traverses a host bridge on the
* allowlist. In this case, a normal mapping either with CPU physical
* addresses (in the case of dma-direct) or IOVA addresses (in the
* case of IOMMUs) should be used to program the DMA engine.
*/
PCI_P2PDMA_MAP_THRU_HOST_BRIDGE,
};
struct pci_p2pdma_map_state {
struct dev_pagemap *pgmap;
struct p2pdma_provider *mem;
enum pci_p2pdma_map_type map;
u64 bus_off;
};
/* helper for pci_p2pdma_state(), do not use directly */
void __pci_p2pdma_update_state(struct pci_p2pdma_map_state *state,
struct device *dev, struct page *page);
@ -162,8 +190,7 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
struct page *page)
{
if (IS_ENABLED(CONFIG_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
if (state->pgmap != page_pgmap(page))
__pci_p2pdma_update_state(state, dev, page);
__pci_p2pdma_update_state(state, dev, page);
return state->map;
}
return PCI_P2PDMA_MAP_NONE;
@ -172,16 +199,15 @@ pci_p2pdma_state(struct pci_p2pdma_map_state *state, struct device *dev,
/**
* pci_p2pdma_bus_addr_map - Translate a physical address to a bus address
* for a PCI_P2PDMA_MAP_BUS_ADDR transfer.
* @state: P2P state structure
* @provider: P2P provider structure
* @paddr: physical address to map
*
* Map a physically contiguous PCI_P2PDMA_MAP_BUS_ADDR transfer.
*/
static inline dma_addr_t
pci_p2pdma_bus_addr_map(struct pci_p2pdma_map_state *state, phys_addr_t paddr)
pci_p2pdma_bus_addr_map(struct p2pdma_provider *provider, phys_addr_t paddr)
{
WARN_ON_ONCE(state->map != PCI_P2PDMA_MAP_BUS_ADDR);
return paddr + state->bus_off;
return paddr + provider->bus_offset;
}
#endif /* _LINUX_PCI_P2P_H */

View File

@ -301,6 +301,8 @@ static inline void vfio_put_device(struct vfio_device *device)
int vfio_register_group_dev(struct vfio_device *device);
int vfio_register_emulated_iommu_dev(struct vfio_device *device);
void vfio_unregister_group_dev(struct vfio_device *device);
bool vfio_device_try_get_registration(struct vfio_device *device);
void vfio_device_put_registration(struct vfio_device *device);
int vfio_assign_device_set(struct vfio_device *device, void *set_id);
unsigned int vfio_device_set_open_count(struct vfio_device_set *dev_set);

View File

@ -26,6 +26,8 @@
struct vfio_pci_core_device;
struct vfio_pci_region;
struct p2pdma_provider;
struct dma_buf_phys_vec;
struct vfio_pci_regops {
ssize_t (*rw)(struct vfio_pci_core_device *vdev, char __user *buf,
@ -49,9 +51,48 @@ struct vfio_pci_region {
u32 flags;
};
struct vfio_pci_device_ops {
int (*get_dmabuf_phys)(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges);
};
#if IS_ENABLED(CONFIG_VFIO_PCI_DMABUF)
int vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len);
int vfio_pci_core_get_dmabuf_phys(struct vfio_pci_core_device *vdev,
struct p2pdma_provider **provider,
unsigned int region_index,
struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges);
#else
static inline int
vfio_pci_core_fill_phys_vec(struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges,
size_t nr_ranges, phys_addr_t start,
phys_addr_t len)
{
return -EINVAL;
}
static inline int vfio_pci_core_get_dmabuf_phys(
struct vfio_pci_core_device *vdev, struct p2pdma_provider **provider,
unsigned int region_index, struct dma_buf_phys_vec *phys_vec,
struct vfio_region_dma_range *dma_ranges, size_t nr_ranges)
{
return -EOPNOTSUPP;
}
#endif
struct vfio_pci_core_device {
struct vfio_device vdev;
struct pci_dev *pdev;
const struct vfio_pci_device_ops *pci_ops;
void __iomem *barmap[PCI_STD_NUM_BARS];
bool bar_mmap_supported[PCI_STD_NUM_BARS];
u8 *pci_config_map;
@ -94,6 +135,7 @@ struct vfio_pci_core_device {
struct vfio_pci_core_device *sriov_pf_core_dev;
struct notifier_block nb;
struct rw_semaphore memory_lock;
struct list_head dmabufs;
};
/* Will be exported for vfio pci drivers usage */

View File

@ -14,6 +14,7 @@
#include <linux/types.h>
#include <linux/ioctl.h>
#include <linux/stddef.h>
#define VFIO_API_VERSION 0
@ -1478,6 +1479,33 @@ struct vfio_device_feature_bus_master {
};
#define VFIO_DEVICE_FEATURE_BUS_MASTER 10
/**
* Upon VFIO_DEVICE_FEATURE_GET create a dma_buf fd for the
* regions selected.
*
* open_flags are the typical flags passed to open(2), eg O_RDWR, O_CLOEXEC,
* etc. offset/length specify a slice of the region to create the dmabuf from.
* nr_ranges is the total number of (P2P DMA) ranges that comprise the dmabuf.
*
* flags should be 0.
*
* Return: The fd number on success, -1 and errno is set on failure.
*/
#define VFIO_DEVICE_FEATURE_DMA_BUF 11
struct vfio_region_dma_range {
__u64 offset;
__u64 length;
};
struct vfio_device_feature_dma_buf {
__u32 region_index;
__u32 open_flags;
__u32 flags;
__u32 nr_ranges;
struct vfio_region_dma_range dma_ranges[] __counted_by(nr_ranges);
};
/* -------- API for Type1 VFIO IOMMU -------- */
/**

View File

@ -479,8 +479,8 @@ int dma_direct_map_sg(struct device *dev, struct scatterlist *sgl, int nents,
}
break;
case PCI_P2PDMA_MAP_BUS_ADDR:
sg->dma_address = pci_p2pdma_bus_addr_map(&p2pdma_state,
sg_phys(sg));
sg->dma_address = pci_p2pdma_bus_addr_map(
p2pdma_state.mem, sg_phys(sg));
sg_dma_mark_bus_address(sg);
continue;
default:

View File

@ -811,7 +811,7 @@ dma_addr_t hmm_dma_map_pfn(struct device *dev, struct hmm_dma_map *map,
break;
case PCI_P2PDMA_MAP_BUS_ADDR:
pfns[idx] |= HMM_PFN_P2PDMA_BUS | HMM_PFN_DMA_MAPPED;
return pci_p2pdma_bus_addr_map(p2pdma_state, paddr);
return pci_p2pdma_bus_addr_map(p2pdma_state->mem, paddr);
default:
return DMA_MAPPING_ERROR;
}