Merge branch 'add-support-for-per-napi-config-via-netlink'

Joe Damato says:

====================
Add support for per-NAPI config via netlink

Greetings:

Welcome to v6. Minor changes from v5 [1], please see changelog below.

There were no explicit comments from reviewers on the call outs in my
v5, so I'm retaining them from my previous cover letter just in case :)

A few important call outs for reviewers:

  1. This revision seems to work (see below for a full walk through). I
     think this is the behavior we talked about, but please let me know if
     a use case is missing.

  2. Re a previous point made by Stanislav regarding "taking over a NAPI
     ID" when the channel count changes: mlx5 seems to call napi_disable
     followed by netif_napi_del for the old queues and then calls
     napi_enable for the new ones. In this RFC, the NAPI ID generation is
     deferred to napi_enable. This means we won't end up with two of the
     same NAPI IDs added to the hash at the same time.

     Can we assume all drivers will napi_disable the old queues before
     napi_enable the new ones?
       - If yes: we might not need to worry about a NAPI ID takeover
       	 function.
       - If no: I'll need to make a change so that the NAPI ID generation
       	 is deferred only for drivers which have opted into the config
       	 space via calls to netif_napi_add_config

  3. I made the decision to remove the WARN_ON_ONCE that (I think?)
     Jakub previously suggested in alloc_netdev_mqs (WARN_ON_ONCE(txqs
     != rxqs);) because this was triggering on every kernel boot with my
     mlx5 NIC.

  4. I left the "maxqs = max(txqs, rxqs);" in alloc_netdev_mqs despite
     thinking this is a bit strange. I think it's strange that we might
     be short some number of NAPI configs, but it seems like most people
     are in favor of this approach, so I've left it.

I'd appreciate thoughts from reviewers on the above items, if at all
possible.

Now, on to the implementation.

Firstly, this implementation moves certain settings to napi_struct so that
they are "per-NAPI", while taking care to respect existing sysfs
parameters which are interface wide and affect all NAPIs:
  - NAPI ID
  - gro_flush_timeout
  - defer_hard_irqs

Furthermore:
   - NAPI ID generation and addition to the hash is now deferred to
     napi_enable, instead of during netif_napi_add
   - NAPIs are removed from the hash during napi_disable, instead of
     netif_napi_del.
   - An array of "struct napi_config" is allocated in net_device.

IMPORTANT: The above changes affect all network drivers.

Optionally, drivers may opt-in to using their config space by calling
netif_napi_add_config instead of netif_napi_add.

If a driver does this, the NAPI being added is linked with an allocated
"struct napi_config" and the per-NAPI settings (including NAPI ID) are
persisted even as hardware queues are destroyed and recreated.

To help illustrate how this would end up working, I've added patches for
3 drivers, of which I have access to only 1:
  - mlx5 which is the basis of the examples below
  - mlx4 which has TX only NAPIs, just to highlight that case. I have
    only compile tested this patch; I don't have this hardware.
  - bnxt which I have only compiled tested. I don't have this
    hardware.

NOTE: I only tested this on mlx5; I have no access to the other hardware
for which I provided patches. Hopefully other folks can help test :)

Here's how it works when I test it on my mlx5 system:

$ ethtool -l eth4 | grep Combined | tail -1
Combined:       2

First, output the current NAPI settings:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, set the global sysfs parameters:

$ sudo bash -c 'echo 20000 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 100 >/sys/class/net/eth4/napi_defer_hard_irqs'

Output current NAPI settings again:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345, via its NAPI ID to specific values:

$ sudo ./tools/net/ynl/cli.py \
          --spec Documentation/netlink/specs/netdev.yaml \
          --do napi-set \
          --json='{"id": 345,
                   "defer-hard-irqs": 111,
                   "gro-flush-timeout": 11111}'
None

Now output current NAPI settings again to ensure only NAPI ID 345
changed:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 11111,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now, increase gro-flush-timeout only:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 44444}'
None

Now output the current NAPI settings once more:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 44444,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now set NAPI ID 345 to have gro_flush_timeout of 0:

$ sudo ./tools/net/ynl/cli.py \
       --spec Documentation/netlink/specs/netdev.yaml \
       --do napi-set --json='{"id": 345,
                              "gro-flush-timeout": 0}'
None

Check that NAPI ID 345 has a value of 0:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Change the queue count, ensuring that NAPI ID 345 retains its settings:

$ sudo ethtool -L eth4 combined 4

Check that the new queues have the system wide settings but that NAPI ID
345 remains unchanged:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 347,
  'ifindex': 7,
  'irq': 529},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Now reduce the queue count below where NAPI ID 345 is indexed:

$ sudo ethtool -L eth4 combined 1

Check the output:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Re-increase the queue count to ensure NAPI ID 345 is re-assigned the same
values:

$ sudo ethtool -L eth4 combined 2

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[{'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Create new queues to ensure the sysfs globals are used for the new NAPIs
but that NAPI ID 345 is unchanged:

$ sudo ethtool -L eth4 combined 8

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'
[...]
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 111,
  'gro-flush-timeout': 0,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 100,
  'gro-flush-timeout': 20000,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

Last, but not least, let's try writing the sysfs parameters to ensure
all NAPIs are rewritten:

$ sudo bash -c 'echo 33333 >/sys/class/net/eth4/gro_flush_timeout'
$ sudo bash -c 'echo 222 >/sys/class/net/eth4/napi_defer_hard_irqs'

Check that worked:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 7}'

[...]
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 346,
  'ifindex': 7,
  'irq': 528},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 345,
  'ifindex': 7,
  'irq': 527},
 {'defer-hard-irqs': 222,
  'gro-flush-timeout': 33333,
  'id': 344,
  'ifindex': 7,
  'irq': 327}]

[1]: https://lore.kernel.org/20241009005525.13651-1-jdamato@fastly.com

v5: https://lore.kernel.org/20241009005525.13651-1-jdamato@fastly.com
rfcv4: https://lore.kernel.org/lkml/20241001235302.57609-1-jdamato@fastly.com
rfcv3: https://lore.kernel.org/20240912100738.16567-8-jdamato@fastly.com
rfcv2: https://lore.kernel.org/20240908160702.56618-1-jdamato@fastly.com
====================

Link: https://patch.msgid.link/20241011184527.16393-1-jdamato@fastly.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This commit is contained in:
Jakub Kicinski 2024-10-14 17:55:08 -07:00
commit 5bedbfc165
14 changed files with 326 additions and 25 deletions

View File

@ -248,6 +248,21 @@ attribute-sets:
threaded mode. If NAPI is not in threaded mode (i.e. uses normal
softirq context), the attribute will be absent.
type: u32
-
name: defer-hard-irqs
doc: The number of consecutive empty polls before IRQ deferral ends
and hardware IRQs are re-enabled.
type: u32
checks:
max: s32-max
-
name: gro-flush-timeout
doc: The timeout, in nanoseconds, of when to trigger the NAPI watchdog
timer which schedules NAPI processing. Additionally, a non-zero
value will also prevent GRO from flushing recent super-frames at
the end of a NAPI cycle. This may add receive latency in exchange
for reducing the number of frames processed by the network stack.
type: uint
-
name: queue
attributes:
@ -636,6 +651,8 @@ operations:
- ifindex
- irq
- pid
- defer-hard-irqs
- gro-flush-timeout
dump:
request:
attributes:
@ -676,6 +693,17 @@ operations:
reply:
attributes:
- id
-
name: napi-set
doc: Set configurable NAPI instance settings.
attribute-set: napi
flags: [ admin-perm ]
do:
request:
attributes:
- id
- defer-hard-irqs
- gro-flush-timeout
kernel-family:
headers: [ "linux/list.h"]

View File

@ -186,4 +186,7 @@ struct dpll_pin* dpll_pin
struct hlist_head page_pools
struct dim_irq_moder* irq_moder
u64 max_pacing_offload_horizon
struct_napi_config* napi_config
unsigned_long gro_flush_timeout
u32 napi_defer_hard_irqs
=================================== =========================== =================== =================== ===================================================================================

View File

@ -10986,7 +10986,8 @@ static void bnxt_init_napi(struct bnxt *bp)
cp_nr_rings--;
for (i = 0; i < cp_nr_rings; i++) {
bnapi = bp->bnapi[i];
netif_napi_add(bp->dev, &bnapi->napi, poll_fn);
netif_napi_add_config(bp->dev, &bnapi->napi, poll_fn,
bnapi->index);
}
if (BNXT_CHIP_TYPE_NITRO_A0(bp)) {
bnapi = bp->bnapi[cp_nr_rings];

View File

@ -156,7 +156,8 @@ int mlx4_en_activate_cq(struct mlx4_en_priv *priv, struct mlx4_en_cq *cq,
break;
case RX:
cq->mcq.comp = mlx4_en_rx_irq;
netif_napi_add(cq->dev, &cq->napi, mlx4_en_poll_rx_cq);
netif_napi_add_config(cq->dev, &cq->napi, mlx4_en_poll_rx_cq,
cq_idx);
netif_napi_set_irq(&cq->napi, irq);
napi_enable(&cq->napi);
netif_queue_set_napi(cq->dev, cq_idx, NETDEV_QUEUE_TYPE_RX, &cq->napi);

View File

@ -2697,7 +2697,7 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
c->aff_mask = irq_get_effective_affinity_mask(irq);
c->lag_port = mlx5e_enumerate_lag_port(mdev, ix);
netif_napi_add(netdev, &c->napi, mlx5e_napi_poll);
netif_napi_add_config(netdev, &c->napi, mlx5e_napi_poll, ix);
netif_napi_set_irq(&c->napi, irq);
err = mlx5e_open_queues(c, params, cparam);

View File

@ -342,6 +342,15 @@ struct gro_list {
*/
#define GRO_HASH_BUCKETS 8
/*
* Structure for per-NAPI config
*/
struct napi_config {
u64 gro_flush_timeout;
u32 defer_hard_irqs;
unsigned int napi_id;
};
/*
* Structure for NAPI scheduling similar to tasklet but with weighting
*/
@ -373,10 +382,14 @@ struct napi_struct {
unsigned int napi_id;
struct hrtimer timer;
struct task_struct *thread;
unsigned long gro_flush_timeout;
u32 defer_hard_irqs;
/* control-path-only fields follow */
struct list_head dev_list;
struct hlist_node napi_hash_node;
int irq;
int index;
struct napi_config *config;
};
enum {
@ -1866,9 +1879,6 @@ enum netdev_reg_state {
* allocated at register_netdev() time
* @real_num_rx_queues: Number of RX queues currently active in device
* @xdp_prog: XDP sockets filter program pointer
* @gro_flush_timeout: timeout for GRO layer in NAPI
* @napi_defer_hard_irqs: If not zero, provides a counter that would
* allow to avoid NIC hard IRQ, on busy queues.
*
* @rx_handler: handler for received packets
* @rx_handler_data: XXX: need comments on this one
@ -2018,6 +2028,11 @@ enum netdev_reg_state {
* where the clock is recovered.
*
* @max_pacing_offload_horizon: max EDT offload horizon in nsec.
* @napi_config: An array of napi_config structures containing per-NAPI
* settings.
* @gro_flush_timeout: timeout for GRO layer in NAPI
* @napi_defer_hard_irqs: If not zero, provides a counter that would
* allow to avoid NIC hard IRQ, on busy queues.
*
* FIXME: cleanup struct net_device such that network protocol info
* moves out.
@ -2084,8 +2099,6 @@ struct net_device {
int ifindex;
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
unsigned int gro_max_size;
unsigned int gro_ipv4_max_size;
rx_handler_func_t __rcu *rx_handler;
@ -2413,6 +2426,9 @@ struct net_device {
struct dim_irq_moder *irq_moder;
u64 max_pacing_offload_horizon;
struct napi_config *napi_config;
unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
/**
* @lock: protects @net_shaper_hierarchy, feel free to use for other
@ -2676,6 +2692,22 @@ netif_napi_add_tx_weight(struct net_device *dev,
netif_napi_add_weight(dev, napi, poll, weight);
}
/**
* netif_napi_add_config - initialize a NAPI context with persistent config
* @dev: network device
* @napi: NAPI context
* @poll: polling function
* @index: the NAPI index
*/
static inline void
netif_napi_add_config(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int index)
{
napi->index = index;
napi->config = &dev->napi_config[index];
netif_napi_add_weight(dev, napi, poll, NAPI_POLL_WEIGHT);
}
/**
* netif_napi_add_tx() - initialize a NAPI context to be used for Tx only
* @dev: network device

View File

@ -122,6 +122,8 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
@ -199,6 +201,7 @@ enum {
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
NETDEV_CMD_BIND_RX,
NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)

View File

@ -6232,12 +6232,12 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (work_done) {
if (n->gro_bitmask)
timeout = READ_ONCE(n->dev->gro_flush_timeout);
n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
timeout = napi_get_gro_flush_timeout(n);
n->defer_hard_irqs_count = napi_get_defer_hard_irqs(n);
}
if (n->defer_hard_irqs_count > 0) {
n->defer_hard_irqs_count--;
timeout = READ_ONCE(n->dev->gro_flush_timeout);
timeout = napi_get_gro_flush_timeout(n);
if (timeout)
ret = false;
}
@ -6371,8 +6371,8 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock,
bpf_net_ctx = bpf_net_ctx_set(&__bpf_net_ctx);
if (flags & NAPI_F_PREFER_BUSY_POLL) {
napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
timeout = READ_ONCE(napi->dev->gro_flush_timeout);
napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi);
timeout = napi_get_gro_flush_timeout(napi);
if (napi->defer_hard_irqs_count && timeout) {
hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
skip_schedule = true;
@ -6505,6 +6505,23 @@ EXPORT_SYMBOL(napi_busy_loop);
#endif /* CONFIG_NET_RX_BUSY_POLL */
static void __napi_hash_add_with_id(struct napi_struct *napi,
unsigned int napi_id)
{
napi->napi_id = napi_id;
hlist_add_head_rcu(&napi->napi_hash_node,
&napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
}
static void napi_hash_add_with_id(struct napi_struct *napi,
unsigned int napi_id)
{
spin_lock(&napi_hash_lock);
WARN_ON_ONCE(napi_by_id(napi_id));
__napi_hash_add_with_id(napi, napi_id);
spin_unlock(&napi_hash_lock);
}
static void napi_hash_add(struct napi_struct *napi)
{
if (test_bit(NAPI_STATE_NO_BUSY_POLL, &napi->state))
@ -6517,10 +6534,8 @@ static void napi_hash_add(struct napi_struct *napi)
if (unlikely(++napi_gen_id < MIN_NAPI_ID))
napi_gen_id = MIN_NAPI_ID;
} while (napi_by_id(napi_gen_id));
napi->napi_id = napi_gen_id;
hlist_add_head_rcu(&napi->napi_hash_node,
&napi_hash[napi->napi_id % HASH_SIZE(napi_hash)]);
__napi_hash_add_with_id(napi, napi_gen_id);
spin_unlock(&napi_hash_lock);
}
@ -6643,6 +6658,28 @@ void netif_queue_set_napi(struct net_device *dev, unsigned int queue_index,
}
EXPORT_SYMBOL(netif_queue_set_napi);
static void napi_restore_config(struct napi_struct *n)
{
n->defer_hard_irqs = n->config->defer_hard_irqs;
n->gro_flush_timeout = n->config->gro_flush_timeout;
/* a NAPI ID might be stored in the config, if so use it. if not, use
* napi_hash_add to generate one for us. It will be saved to the config
* in napi_disable.
*/
if (n->config->napi_id)
napi_hash_add_with_id(n, n->config->napi_id);
else
napi_hash_add(n);
}
static void napi_save_config(struct napi_struct *n)
{
n->config->defer_hard_irqs = n->defer_hard_irqs;
n->config->gro_flush_timeout = n->gro_flush_timeout;
n->config->napi_id = n->napi_id;
napi_hash_del(n);
}
void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
int (*poll)(struct napi_struct *, int), int weight)
{
@ -6670,7 +6707,13 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
set_bit(NAPI_STATE_SCHED, &napi->state);
set_bit(NAPI_STATE_NPSVC, &napi->state);
list_add_rcu(&napi->dev_list, &dev->napi_list);
napi_hash_add(napi);
/* default settings from sysfs are applied to all NAPIs. any per-NAPI
* configuration will be loaded in napi_enable
*/
napi_set_defer_hard_irqs(napi, READ_ONCE(dev->napi_defer_hard_irqs));
napi_set_gro_flush_timeout(napi, READ_ONCE(dev->gro_flush_timeout));
napi_get_frags_check(napi);
/* Create kthread for this napi if dev->threaded is set.
* Clear dev->threaded if kthread creation failed so that
@ -6702,6 +6745,11 @@ void napi_disable(struct napi_struct *n)
hrtimer_cancel(&n->timer);
if (n->config)
napi_save_config(n);
else
napi_hash_del(n);
clear_bit(NAPI_STATE_DISABLE, &n->state);
}
EXPORT_SYMBOL(napi_disable);
@ -6717,6 +6765,11 @@ void napi_enable(struct napi_struct *n)
{
unsigned long new, val = READ_ONCE(n->state);
if (n->config)
napi_restore_config(n);
else
napi_hash_add(n);
do {
BUG_ON(!test_bit(NAPI_STATE_SCHED, &val));
@ -6746,7 +6799,11 @@ void __netif_napi_del(struct napi_struct *napi)
if (!test_and_clear_bit(NAPI_STATE_LISTED, &napi->state))
return;
napi_hash_del(napi);
if (napi->config) {
napi->index = -1;
napi->config = NULL;
}
list_del_rcu(&napi->dev_list);
napi_free_frags(napi);
@ -11058,8 +11115,8 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev)
WARN_ON(dev->reg_state == NETREG_REGISTERED);
if (!IS_ENABLED(CONFIG_PREEMPT_RT)) {
dev->gro_flush_timeout = 20000;
dev->napi_defer_hard_irqs = 1;
netdev_set_gro_flush_timeout(dev, 20000);
netdev_set_defer_hard_irqs(dev, 1);
}
}
EXPORT_SYMBOL_GPL(netdev_sw_irq_coalesce_default_on);
@ -11083,6 +11140,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
unsigned int txqs, unsigned int rxqs)
{
struct net_device *dev;
size_t napi_config_sz;
unsigned int maxqs;
BUG_ON(strlen(name) >= sizeof(dev->name));
@ -11096,6 +11155,8 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
return NULL;
}
maxqs = max(txqs, rxqs);
dev = kvzalloc(struct_size(dev, priv, sizeof_priv),
GFP_KERNEL_ACCOUNT | __GFP_RETRY_MAYFAIL);
if (!dev)
@ -11172,6 +11233,11 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
if (!dev->ethtool)
goto free_all;
napi_config_sz = array_size(maxqs, sizeof(*dev->napi_config));
dev->napi_config = kvzalloc(napi_config_sz, GFP_KERNEL_ACCOUNT);
if (!dev->napi_config)
goto free_all;
strscpy(dev->name, name);
dev->name_assign_type = name_assign_type;
dev->group = INIT_NETDEV_GROUP;
@ -11235,6 +11301,8 @@ void free_netdev(struct net_device *dev)
list_for_each_entry_safe(p, n, &dev->napi_list, dev_list)
netif_napi_del(p);
kvfree(dev->napi_config);
ref_tracker_dir_exit(&dev->refcnt_tracker);
#ifdef CONFIG_PCPU_DEV_REFCNT
free_percpu(dev->pcpu_refcnt);
@ -12002,8 +12070,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_flush_timeout);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, napi_defer_hard_irqs);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, gro_ipv4_max_size);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, rx_handler);
@ -12015,7 +12081,7 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 104);
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
}
/*

View File

@ -148,6 +148,94 @@ static inline void netif_set_gro_ipv4_max_size(struct net_device *dev,
WRITE_ONCE(dev->gro_ipv4_max_size, size);
}
/**
* napi_get_defer_hard_irqs - get the NAPI's defer_hard_irqs
* @n: napi struct to get the defer_hard_irqs field from
*
* Return: the per-NAPI value of the defar_hard_irqs field.
*/
static inline u32 napi_get_defer_hard_irqs(const struct napi_struct *n)
{
return READ_ONCE(n->defer_hard_irqs);
}
/**
* napi_set_defer_hard_irqs - set the defer_hard_irqs for a napi
* @n: napi_struct to set the defer_hard_irqs field
* @defer: the value the field should be set to
*/
static inline void napi_set_defer_hard_irqs(struct napi_struct *n, u32 defer)
{
WRITE_ONCE(n->defer_hard_irqs, defer);
}
/**
* netdev_set_defer_hard_irqs - set defer_hard_irqs for all NAPIs of a netdev
* @netdev: the net_device for which all NAPIs will have defer_hard_irqs set
* @defer: the defer_hard_irqs value to set
*/
static inline void netdev_set_defer_hard_irqs(struct net_device *netdev,
u32 defer)
{
unsigned int count = max(netdev->num_rx_queues,
netdev->num_tx_queues);
struct napi_struct *napi;
int i;
WRITE_ONCE(netdev->napi_defer_hard_irqs, defer);
list_for_each_entry(napi, &netdev->napi_list, dev_list)
napi_set_defer_hard_irqs(napi, defer);
for (i = 0; i < count; i++)
netdev->napi_config[i].defer_hard_irqs = defer;
}
/**
* napi_get_gro_flush_timeout - get the gro_flush_timeout
* @n: napi struct to get the gro_flush_timeout from
*
* Return: the per-NAPI value of the gro_flush_timeout field.
*/
static inline unsigned long
napi_get_gro_flush_timeout(const struct napi_struct *n)
{
return READ_ONCE(n->gro_flush_timeout);
}
/**
* napi_set_gro_flush_timeout - set the gro_flush_timeout for a napi
* @n: napi struct to set the gro_flush_timeout
* @timeout: timeout value to set
*
* napi_set_gro_flush_timeout sets the per-NAPI gro_flush_timeout
*/
static inline void napi_set_gro_flush_timeout(struct napi_struct *n,
unsigned long timeout)
{
WRITE_ONCE(n->gro_flush_timeout, timeout);
}
/**
* netdev_set_gro_flush_timeout - set gro_flush_timeout of a netdev's NAPIs
* @netdev: the net_device for which all NAPIs will have gro_flush_timeout set
* @timeout: the timeout value to set
*/
static inline void netdev_set_gro_flush_timeout(struct net_device *netdev,
unsigned long timeout)
{
unsigned int count = max(netdev->num_rx_queues,
netdev->num_tx_queues);
struct napi_struct *napi;
int i;
WRITE_ONCE(netdev->gro_flush_timeout, timeout);
list_for_each_entry(napi, &netdev->napi_list, dev_list)
napi_set_gro_flush_timeout(napi, timeout);
for (i = 0; i < count; i++)
netdev->napi_config[i].gro_flush_timeout = timeout;
}
int rps_cpumask_housekeeping(struct cpumask *mask);
#if defined(CONFIG_DEBUG_NET) && defined(CONFIG_BPF_SYSCALL)

View File

@ -409,7 +409,7 @@ NETDEVICE_SHOW_RW(tx_queue_len, fmt_dec);
static int change_gro_flush_timeout(struct net_device *dev, unsigned long val)
{
WRITE_ONCE(dev->gro_flush_timeout, val);
netdev_set_gro_flush_timeout(dev, val);
return 0;
}
@ -429,7 +429,7 @@ static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val
if (val > S32_MAX)
return -ERANGE;
WRITE_ONCE(dev->napi_defer_hard_irqs, val);
netdev_set_defer_hard_irqs(dev, (u32)val);
return 0;
}

View File

@ -22,6 +22,10 @@ static const struct netlink_range_validation netdev_a_page_pool_ifindex_range =
.max = 2147483647ULL,
};
static const struct netlink_range_validation netdev_a_napi_defer_hard_irqs_range = {
.max = 2147483647ULL,
};
/* Common nested types */
const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1] = {
[NETDEV_A_PAGE_POOL_ID] = NLA_POLICY_FULL_RANGE(NLA_UINT, &netdev_a_page_pool_id_range),
@ -87,6 +91,13 @@ static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1]
[NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy),
};
/* NETDEV_CMD_NAPI_SET - do */
static const struct nla_policy netdev_napi_set_nl_policy[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT + 1] = {
[NETDEV_A_NAPI_ID] = { .type = NLA_U32, },
[NETDEV_A_NAPI_DEFER_HARD_IRQS] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_napi_defer_hard_irqs_range),
[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT] = { .type = NLA_UINT, },
};
/* Ops table for netdev */
static const struct genl_split_ops netdev_nl_ops[] = {
{
@ -171,6 +182,13 @@ static const struct genl_split_ops netdev_nl_ops[] = {
.maxattr = NETDEV_A_DMABUF_FD,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
{
.cmd = NETDEV_CMD_NAPI_SET,
.doit = netdev_nl_napi_set_doit,
.policy = netdev_napi_set_nl_policy,
.maxattr = NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
.flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO,
},
};
static const struct genl_multicast_group netdev_nl_mcgrps[] = {

View File

@ -33,6 +33,7 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb);
int netdev_nl_qstats_get_dumpit(struct sk_buff *skb,
struct netlink_callback *cb);
int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info);
int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info);
enum {
NETDEV_NLGRP_MGMT,

View File

@ -161,6 +161,8 @@ static int
netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
const struct genl_info *info)
{
unsigned long gro_flush_timeout;
u32 napi_defer_hard_irqs;
void *hdr;
pid_t pid;
@ -189,6 +191,16 @@ netdev_nl_napi_fill_one(struct sk_buff *rsp, struct napi_struct *napi,
goto nla_put_failure;
}
napi_defer_hard_irqs = napi_get_defer_hard_irqs(napi);
if (nla_put_s32(rsp, NETDEV_A_NAPI_DEFER_HARD_IRQS,
napi_defer_hard_irqs))
goto nla_put_failure;
gro_flush_timeout = napi_get_gro_flush_timeout(napi);
if (nla_put_uint(rsp, NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
gro_flush_timeout))
goto nla_put_failure;
genlmsg_end(rsp, hdr);
return 0;
@ -291,6 +303,51 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
return err;
}
static int
netdev_nl_napi_set_config(struct napi_struct *napi, struct genl_info *info)
{
u64 gro_flush_timeout = 0;
u32 defer = 0;
if (info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]) {
defer = nla_get_u32(info->attrs[NETDEV_A_NAPI_DEFER_HARD_IRQS]);
napi_set_defer_hard_irqs(napi, defer);
}
if (info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]) {
gro_flush_timeout = nla_get_uint(info->attrs[NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT]);
napi_set_gro_flush_timeout(napi, gro_flush_timeout);
}
return 0;
}
int netdev_nl_napi_set_doit(struct sk_buff *skb, struct genl_info *info)
{
struct napi_struct *napi;
unsigned int napi_id;
int err;
if (GENL_REQ_ATTR_CHECK(info, NETDEV_A_NAPI_ID))
return -EINVAL;
napi_id = nla_get_u32(info->attrs[NETDEV_A_NAPI_ID]);
rtnl_lock();
napi = napi_by_id(napi_id);
if (napi) {
err = netdev_nl_napi_set_config(napi, info);
} else {
NL_SET_BAD_ATTR(info->extack, info->attrs[NETDEV_A_NAPI_ID]);
err = -ENOENT;
}
rtnl_unlock();
return err;
}
static int
netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev,
u32 q_idx, u32 q_type, const struct genl_info *info)

View File

@ -122,6 +122,8 @@ enum {
NETDEV_A_NAPI_ID,
NETDEV_A_NAPI_IRQ,
NETDEV_A_NAPI_PID,
NETDEV_A_NAPI_DEFER_HARD_IRQS,
NETDEV_A_NAPI_GRO_FLUSH_TIMEOUT,
__NETDEV_A_NAPI_MAX,
NETDEV_A_NAPI_MAX = (__NETDEV_A_NAPI_MAX - 1)
@ -199,6 +201,7 @@ enum {
NETDEV_CMD_NAPI_GET,
NETDEV_CMD_QSTATS_GET,
NETDEV_CMD_BIND_RX,
NETDEV_CMD_NAPI_SET,
__NETDEV_CMD_MAX,
NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1)