Commit Graph

9796 Commits

Author SHA1 Message Date
Jacob Keller c7d6cb96d5 igc: fix lock order in igc_ptp_reset
Commit 1a931c4f5e ("igc: add lock preventing multiple simultaneous PTM
transactions") added a new mutex to protect concurrent PTM transactions.
This lock is acquired in igc_ptp_reset() in order to ensure the PTM
registers are properly disabled after a device reset.

The flow where the lock is acquired already holds a spinlock, so acquiring
a mutex leads to a sleep-while-locking bug, reported both by smatch,
and the kernel test robot.

The critical section in igc_ptp_reset() does correctly use the
readx_poll_timeout_atomic variants, but the standard PTM flow uses regular
sleeping variants. This makes converting the mutex to a spinlock a bit
tricky.

Instead, re-order the locking in igc_ptp_reset. Acquire the mutex first,
and then the tmreg_lock spinlock. This is safe because there is no other
ordering dependency on these locks, as this is the only place where both
locks were acquired simultaneously. Indeed, any other flow acquiring locks
in that order would be wrong regardless.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Fixes: 1a931c4f5e ("igc: add lock preventing multiple simultaneous PTM transactions")
Link: https://lore.kernel.org/intel-wired-lan/Z_-P-Hc1yxcw0lTB@stanley.mountain/
Link: https://lore.kernel.org/intel-wired-lan/202504211511.f7738f5d-lkp@intel.com/T/#u
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-29 14:28:40 -07:00
Larysa Zaremba ed375b1821 idpf: protect shutdown from reset
Before the referenced commit, the shutdown just called idpf_remove(),
this way IDPF_REMOVE_IN_PROG was protecting us from the serv_task
rescheduling reset. Without this flag set the shutdown process is
vulnerable to HW reset or any other triggering conditions (such as
default mailbox being destroyed).

When one of conditions checked in idpf_service_task becomes true,
vc_event_task can be rescheduled during shutdown, this leads to accessing
freed memory e.g. idpf_req_rel_vector_indexes() trying to read
vport->q_vector_idxs. This in turn causes the system to become defunct
during e.g. systemctl kexec.

Considering using IDPF_REMOVE_IN_PROG would lead to more heavy shutdown
process, instead just cancel the serv_task before cancelling
adapter->serv_task before cancelling adapter->vc_event_task to ensure that
reset will not be scheduled while we are doing a shutdown.

Fixes: 4c9106f490 ("idpf: fix adapter NULL pointer dereference on reboot")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-29 14:28:40 -07:00
Michal Swiatkowski 8a558cbda5 idpf: fix potential memory leak on kcalloc() failure
In case of failing on rss_data->rss_key allocation the function is
freeing vport without freeing earlier allocated q_vector_idxs. Fix it.

Move from freeing in error branch to goto scheme.

Fixes: d4d5587182 ("idpf: initialize interrupts and enable vport")
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Suggested-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-29 14:28:24 -07:00
Madhu Chittim 713dd6c2de idpf: fix offloads support for encapsulated packets
Split offloads into csum, tso and other offloads so that tunneled
packets do not by default have all the offloads enabled.

Stateless offloads for encapsulated packets are not yet supported in
firmware/software but in the driver we were setting the features same as
non encapsulated features.

Fixed naming to clarify CSUM bits are being checked for Tx.

Inherit netdev features to VLAN interfaces as well.

Fixes: 0fe45467a1 ("idpf: add create vport and netdev configuration")
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Zachary Goldstein <zachmgoldstein@google.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20250425222636.3188441-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:59:13 -07:00
Xuanqiang Luo 425c5f266b ice: Check VF VSI Pointer Value in ice_vc_add_fdir_fltr()
As mentioned in the commit baeb705fd6 ("ice: always check VF VSI
pointer values"), we need to perform a null pointer check on the return
value of ice_get_vf_vsi() before using it.

Fixes: 6ebbe97a48 ("ice: Add a per-VF limit on number of FDIR filters")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250425222636.3188441-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:59:13 -07:00
Paul Greenwalt 3ffcd7b657 ice: fix Get Tx Topology AQ command error on E830
The Get Tx Topology AQ command (opcode 0x0418) has different read flag
requirements depending on the hardware/firmware. For E810, E822, and E823
firmware the read flag must be set, and for newer hardware (E825 and E830)
it must not be set.

This results in failure to configure Tx topology and the following warning
message during probe:

  DDP package does not support Tx scheduling layers switching feature -
  please update to the latest DDP package and try again

The current implementation only handles E825-C but not E830. It is
confusing as we first check ice_is_e825c() and then set the flag in the set
case. Finally, we check ice_is_e825c() again and set the flag for all other
hardware in both the set and get case.

Instead, notice that we always need the read flag for set, but only need
the read flag for get on E810, E822, and E823 firmware. Fix the logic to
check the MAC type and set the read flag in get only on the older devices
which require it.

Fixes: ba1124f58a ("ice: Add E830 device IDs, MAC type and registers")
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250425222636.3188441-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-28 15:59:13 -07:00
Faizal Rahim f05ce73cc3 igc: add support to get frame preemption statistics via ethtool
Implemented "ethtool --include-statistics --show-mm" callback for IGC.

Tested preemption scenario to check preemption statistics:
1) Trigger verification handshake on both boards:
    $ sudo ethtool --set-mm enp1s0 pmac-enabled on
    $ sudo ethtool --set-mm enp1s0 tx-enabled on
    $ sudo ethtool --set-mm enp1s0 verify-enabled on
2) Set preemptible or express queue in taprio for tx board:
    $ sudo tc qdisc replace dev enp1s0 parent root handle 100 taprio \
      num_tc 4 map 3 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
      queues 1@0 1@1 1@2 1@3 base-time 0 sched-entry S F 100000 \
      fp E E P P
3) Send large size packets on preemptible queue
4) Send small size packets on express queue to preempt packets in
   preemptible queue
5) Show preemption statistics on the receiving board:
   $ ethtool --include-statistics --show-mm enp1s0
     MAC Merge layer state for enp1s0:
     pMAC enabled: on
     TX enabled: on
     TX active: on
     TX minimum fragment size: 64
     RX minimum fragment size: 60
     Verify enabled: on
     Verify time: 128
     Max verify time: 128
     Verification status: SUCCEEDED
     Statistics:
      MACMergeFrameAssErrorCount: 0
      MACMergeFrameSmdErrorCount: 0
      MACMergeFrameAssOkCount: 511
      MACMergeFragCountRx: 764
      MACMergeFragCountTx: 0
      MACMergeHoldCount: 0

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Co-developed-by: Chwee-Lin Choong <chwee.lin.choong@intel.com>
Signed-off-by: Chwee-Lin Choong <chwee.lin.choong@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 09:17:14 -07:00
Faizal Rahim 10e2ffe10e igc: add support to get MAC Merge data via ethtool
Implement "ethtool --show-mm" callback for IGC.

Tested with command:
$ ethtool --show-mm enp1s0.
  MAC Merge layer state for enp1s0:
  pMAC enabled: on
  TX enabled: on
  TX active: on
  TX minimum fragment size: 64
  RX minimum fragment size: 60
  Verify enabled: on
  Verify time: 128
  Max verify time: 128
  Verification status: SUCCEEDED

Verified that the fields value are retrieved correctly.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 09:17:14 -07:00
Faizal Rahim e9074d7f37 igc: block setting preemptible traffic class in taprio
Since preemptible tc implementation is not ready yet, block it from being
set in taprio. The existing code already blocks it in mqprio.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 09:17:14 -07:00
Faizal Rahim 55ececab98 igc: add support to set tx-min-frag-size
Add support for setting tx-min-frag-size via the set_mm callback in igc.
If the requested value is unsupported, round it up to the smallest
supported i226 size (64, 128, 192, 256) and send a netlink message to
inform the user.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 09:17:14 -07:00
Faizal Rahim 5422570c00 igc: add support for frame preemption verification
This patch implements the "ethtool --set-mm" callback to trigger the
frame preemption verification handshake.

Uses the MAC Merge Software Verification (mmsv) mechanism in ethtool
to perform the verification handshake for igc.
The structure fpe.mmsv is set by mmsv in ethtool and should remain
read-only for the driver.

Other mmsv callbacks:
a) configure_tx() -> not used yet at this point
   - igc lacks registers to configure FPE in the transmit direction, so
     this API is not utilized for now. When igc supports preemptible queue,
     driver will use this API to manage its configuration.

b) configure_pmac() -> not used
   - this callback dynamically controls pmac_enabled at runtime. For
     example, mmsv calls configure_pmac() and disables pmac_enabled when
     the link partner goes down, even if the user previously enabled it.
     The intention is to save power but it is not feasible in igc
     because it causes an endless adapter reset loop:

   1) Board A and Board B complete the verification handshake. Tx mode
      register for both boards are in TSN mode.
   2) Board B link goes down.

   On Board A:
   3) mmsv calls configure_pmac() with pmac_enabled = false.
   4) configure_pmac() in igc updates a new field based on pmac_enabled.
      Driver uses this field in igc_tsn_new_flags() to indicate that the
      user enabled/disabled FPE.
   5) configure_pmac() in igc calls igc_tsn_offload_apply() to check
      whether an adapter reset is needed. Calls existing logic in
      igc_tsn_will_tx_mode_change() and igc_tsn_new_flags().
   6) Since pmac_enabled is now disabled and no other TSN feature is
      active, igc_tsn_will_tx_mode_change() evaluates to true because Tx
      mode will switch from TSN to Legacy.
   7) Driver resets the adapter.
   8) Registers are set, and Tx mode switches to Legacy.
   9) When link partner is up, steps 3-8 repeat, but this time with
      pmac_enabled = true, reactivating TSN.
      igc_tsn_will_tx_mode_change() evaluates to true again, since Tx
      mode will switch from Legacy to TSN.
  10) Driver resets the adapter.
  11) Adapter reset completes, registers are set, and Tx mode switches to
      TSN.

  On Board B:
  12) Adapter reset on Board A at step 10 causes it to detect its link
      partner as down.
  13) Repeats steps 3-8.
  14) Once reset adapter on Board A is completed at step 11, it detects
      its link partner as up.
  15) Repeats steps 9-11.

   - this cycle repeats indefinitely. To avoid this issue, igc only uses
     mmsv.pmac_enabled to track whether FPE is enabled or disabled.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Co-developed-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Signed-off-by: Choong Yong Liang <yong.liang.choong@linux.intel.com>
Co-developed-by: Chwee-Lin Choong <chwee.lin.choong@intel.com>
Signed-off-by: Chwee-Lin Choong <chwee.lin.choong@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 09:16:58 -07:00
Faizal Rahim 7663370e32 igc: set the RX packet buffer size for TSN mode
In preparation for supporting frame preemption, when entering TSN mode,
set the receive packet buffer to 15KB for the Express MAC, 15KB for
the Preemptible MAC and 2KB for the BMC.

References:
I225/I226 SW User Manual, Section 4.7.9, Section 7.1.3.2, Section 8.3.1

The newly introduced macros follow the naming from the i226 SW User Manual
for easy reference.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Faizal Rahim 9cd87aafc7 igc: use FIELD_PREP and GENMASK for existing RX packet buffer size
Prepare for an upcoming patch that modifies the RX buffer size in TSN mode.
Refactor IGC_RXPBSIZE_EXP_BMC_DEFAULT and IGC_RXPBS_CFG_TS_EN using
FIELD_PREP and GENMASK to improve clarity and maintainability. Refactor
both macros for consistency, even though the upcoming patch only use
IGC_RXPBSIZE_EXP_BMC_DEFAULT.

The newly introduced macros follow the naming from the i226 SW User Manual
for easy reference.

I've tested IGC_RXPBSIZE_EXP_BMC_DEFAULT and IGC_RXPBS_CFG_TS_EN before
and after the refactoring, and their values remain unchanged.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Faizal Rahim 0d58cdc902 igc: optimize TX packet buffer utilization for TSN mode
In preparation for upcoming frame preemption patches, optimize the TX
packet buffer size. The total packet buffer size (RX + TX) is 64KB, with
a maximum of 34KB for either RX or TX. Split the buffer evenly,
allocating 32KB to each.

For TX, assign 7KB to each of the four TX packet buffers (total 28KB)
and reserve 4KB for BMC.

References:
I225/I226 SW User Manual Section 4.7.9, Section 8.3.2

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Faizal Rahim 425d8d9cb0 igc: use FIELD_PREP and GENMASK for existing TX packet buffer size
In preparation for an upcoming patch that will modify the TX buffer size
in  TSN mode, replace IGC_TXPBSIZE_TSN and IGC_TXPBSIZE_DEFAULT
implementation with new macros that utilizes FIELD_PREP and GENMASK for
clarity.

The newly introduced macros follow the naming from the i226 SW User Manual
for easy reference.

I've tested IGC_TXPBSIZE_TSN and IGC_TXPBSIZE_DEFAULT before and after the
refactoring, and their values remain unchanged.

Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Faizal Rahim 67287d67be igc: rename I225_RXPBSIZE_DEFAULT and I225_TXPBSIZE_DEFAULT
Rename RX and TX packet buffer size macros in preparation for an
upcoming patch that will refactor buffer size handling using FIELD_PREP
and GENMASK.

Changes:
- Rename I225_RXPBSIZE_DEFAULT to IGC_RXPBSIZE_EXP_BMC_DEFAULT.
  The EXP_BMC suffix explicitly indicates Express and BMC buffer
  default values, improving readability and reusability for the
  upcoming changes, while also better reflecting the current buffer
  allocations.
- Rename I225_TXPBSIZE_DEFAULT to IGC_TXPBSIZE_DEFAULT.

These registers apply to both i225 and i226, so using the IGC prefix
aligns with existing macro naming conventions.

Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Faizal Rahim 19d629079c igc: rename xdp_get_tx_ring() for non-xdp usage
Renamed xdp_get_tx_ring() function to a more generic name for use in
upcoming frame preemption patches.

Signed-off-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-18 08:43:10 -07:00
Jakub Kicinski 240ce924d2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.15-rc3).

No conflicts. Adjacent changes:

tools/net/ynl/pyynl/ynl_gen_c.py
  4d07bbf2d4 ("tools: ynl-gen: don't declare loop iterator in place")
  7e8ba0c7de ("tools: ynl: don't use genlmsghdr in classic netlink")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-17 12:26:50 -07:00
Jakub Kicinski 4e34a84061 Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ixgbe: Add basic devlink support

Jedrzej Jagielski says:

Create devlink specific directory for more convenient future feature
development.

Flashing and reloading are supported only by E610 devices.

Introduce basic FW/NVM validation since devlink reload introduces
possibility of runtime NVM update. Check FW API version, FW recovery
mode and FW rollback mode. Introduce minimal recovery probe to let
user to reload the faulty FW when recovery mode is detected.

* '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ixgbe: add support for FW rollback mode
  ixgbe: add E610 implementation of FW recovery mode
  ixgbe: add FW API version check
  ixgbe: add support for devlink reload
  ixgbe: add device flash update via devlink
  ixgbe: extend .info_get() with stored versions
  ixgbe: add E610 functions getting PBA and FW ver info
  ixgbe: add .info_get extension specific for E610 devices
  ixgbe: read the netlist version information
  ixgbe: read the OROM version information
  ixgbe: add E610 functions for acquiring flash data
  ixgbe: add handler for devlink .info_get()
  ixgbe: add initial devlink support
  ixgbe: wrap netdev_priv() usage
  devlink: add value check to devlink_info_version_put()
====================

Link: https://patch.msgid.link/20250415221301.1633933-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-16 19:05:21 -07:00
Cosmin Ratiu 43eca05b6a xfrm: Add explicit dev to .xdo_dev_state_{add,delete,free}
Previously, device driver IPSec offload implementations would fall into
two categories:
1. Those that used xso.dev to determine the offload device.
2. Those that used xso.real_dev to determine the offload device.

The first category didn't work with bonding while the second did.
In a non-bonding setup the two pointers are the same.

This commit adds explicit pointers for the offload netdevice to
.xdo_dev_state_add() / .xdo_dev_state_delete() / .xdo_dev_state_free()
which eliminates the confusion and allows drivers from the first
category to work with bonding.

xso.real_dev now becomes a private pointer managed by the bonding
driver.

Signed-off-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-04-16 11:01:41 +02:00
Jacob Keller d9f3e9ecc4 net: ptp: introduce .supported_perout_flags to ptp_clock_info
The PTP_PEROUT_REQUEST2 ioctl has gained support for flags specifying
specific output behavior including PTP_PEROUT_ONE_SHOT,
PTP_PEROUT_DUTY_CYCLE, PTP_PEROUT_PHASE.

Driver authors are notorious for not checking the flags of the request.
This results in misinterpreting the request, generating an output signal
that does not match the requested value. It is anticipated that even more
flags will be added in the future, resulting in even more broken requests.

Expecting these issues to be caught during review or playing whack-a-mole
after the fact is not a great solution.

Instead, introduce the supported_perout_flags field in the ptp_clock_info
structure. Update the core character device logic to explicitly reject any
request which has a flag not on this list.

This ensures that drivers must 'opt in' to the flags they support. Drivers
which don't set the .supported_perout_flags field will not need to check
that unsupported flags aren't passed, as the core takes care of this.

Update the drivers which do support flags to set this new field.

Note the following driver files set n_per_out to a non-zero value but did
not check the flags at all:

 • drivers/ptp/ptp_clockmatrix.c
 • drivers/ptp/ptp_idt82p33.c
 • drivers/ptp/ptp_fc3.c
 • drivers/net/ethernet/ti/am65-cpts.c
 • drivers/net/ethernet/aquantia/atlantic/aq_ptp.c
 • drivers/net/ethernet/broadcom/bnxt/bnxt_ptp.c
 • drivers/net/dsa/sja1105/sja1105_ptp.c
 • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
 • drivers/net/ethernet/mscc/ocelot_vsc7514.c
 • drivers/net/ethernet/intel/i40e/i40e_ptp.c

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-2-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 20:20:58 -07:00
Jacob Keller 7c571ac57d net: ptp: introduce .supported_extts_flags to ptp_clock_info
The PTP_EXTTS_REQUEST(2) ioctl has a flags field which specifies how the
external timestamp request should behave. This includes which edge of the
signal to timestamp, as well as a specialized "offset" mode. It is expected
that more flags will be added in the future.

Driver authors routinely do not check the flags, often accepting requests
with flags which they do not support. Even drivers which do check flags may
not be future-proofed to reject flags not yet defined. Thus, any future
flag additions often require manually updating drivers to reject these
flags.

This approach of hoping we catch flag checks during review, or playing
whack-a-mole after the fact is the wrong approach.

Introduce the "supported_extts_flags" field to the ptp_clock_info
structure. This field defines the set of flags the device actually
supports.

Update the core character device logic to check this field and reject
unsupported requests. Getting this right is somewhat tricky. First, to
avoid unnecessary repetition and make basic functionality work when
.supported_extts_flags is 0, the core always accepts the PTP_ENABLE_FEATURE
flag. This flag is used to set the 'on' parameter to the .enable function
and is thus always 'supported' by all drivers.

For backwards compatibility, the PTP_RISING_EDGE and PTP_FALLING_EDGE flags
are merely "hints" when using the old PTP_EXTTS_REQUEST ioctl, and are not
expected to be enforced. If the user issues PTP_EXTTS_REQUEST2, the
PTP_STRICT_FLAGS flag is added which is supposed to inform the driver to
strictly validate the flags and reject unsupported requests. To handle
this, first check if the driver reports PTP_STRICT_FLAGS support. If it
does not, then always allow the PTP_RISING_EDGE and PTP_FALLING_EDGE flags.
This keeps backwards compatibility with the original PTP_EXTTS_REQUEST
ioctl where these flags are not guaranteed to be honored.

This way, drivers which do not set the supported_extts_flags will continue
to accept requests for the original PTP_EXTTS_REQUEST ioctl. The core will
automatically reject requests with new flags, and correctly reject requests
with PTP_STRICT_FLAGS, where the driver is supposed to strictly validate
the flags.

Update the various drivers, refactoring their validation logic into the
.supported_extts_flags field. For consistency and readability,
PTP_ENABLE_FEATURE is not set in the supported flags list, and
PTP_EXTTS_EDGES is expanded to PTP_RISING_EDGE | PTP_FALLING_EDGE in all
cases.

Note the following driver files set n_ext_ts to a non-zero value but did
not check flags at all:

 • drivers/net/ethernet/freescale/dpaa2/dpaa2-ptp.c
 • drivers/net/ethernet/freescale/enetc/enetc_ptp.c
 • drivers/net/ethernet/intel/i40e/i40e_ptp.c
 • drivers/net/ethernet/marvell/octeontx2/nic/otx2_ptp.c
 • drivers/net/ethernet/renesas/ravb_ptp.c
 • drivers/net/ethernet/renesas/rtsn.c
 • drivers/net/ethernet/renesas/rtsn.h
 • drivers/net/ethernet/ti/am65-cpts.c
 • drivers/net/ethernet/ti/cpts.h
 • drivers/net/ethernet/ti/icssg/icss_iep.c
 • drivers/net/ethernet/xscale/ptp_ixp46x.c
 • drivers/net/phy/bcm-phy-ptp.c
 • drivers/ptp/ptp_ocp.c
 • drivers/ptp/ptp_pch.c
 • drivers/ptp/ptp_qoriq.c

These drivers behavior does change slightly: they will now reject the
PTP_EXTTS_REQUEST2 ioctl, because they do not strictly validate their
flags. This also makes them no longer incorrectly accept PTP_EXT_OFFSET.

Also note that the renesas ravb driver does not support PTP_STRICT_FLAGS.
We could leave the .supported_extts_flags as 0, but I added the
PTP_RISING_EDGE | PTP_FALLING_EDGE since the driver previously manually
validated these flags. This is equivalent to 0 because the core will allow
these flags regardless unless PTP_STRICT_FLAGS is also set.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Kory Maincent <kory.maincent@bootlin.com>
Link: https://patch.msgid.link/20250414-jk-supported-perout-flags-v2-1-f6b17d15475c@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-04-15 20:20:57 -07:00
Andrii Staikov 4811b0c220 ixgbe: add support for FW rollback mode
The driver should detect whether the device entered FW rollback
mode and then notify user with the dedicated message including
FW and NVM versions.

Even if the driver detected rollback mode, this should not result
in an probe error and the normal flow proceeds.

FW tries to rollback to "old" operational FW located in the
inactive NVM bank in cases when newly loaded FW exhibits faulty
behavior. If something goes wrong during boot the FW may switch
into rollback mode in an attempt to avoid recovery mode and stay
operational. After rollback is successful, the banks are swapped,
and the "rollback" bank becomes the active bank for the next reset.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Andrii Staikov <andrii.staikov@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:33 -07:00
Jedrzej Jagielski 29cb3b8d95 ixgbe: add E610 implementation of FW recovery mode
Add E610 implementation of fw_recovery_mode MAC operation.

In case of E610 information about recovery mode is obtained
from FW_MODES field in IXGBE_GL_MNG_FWSM register (0x000B6134).

Introduce recovery specific probing flow and init only
vital features.

User should be able to perform NVM update using devlink
once FW error is detected in order to load a healthy img.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:33 -07:00
Jedrzej Jagielski b5aae90b6b ixgbe: add FW API version check
Add E610 specific function checking whether the FW API version
is compatible with the driver expectations.

The major API version should be less than or equal to the expected
API version. If not the driver won't be fully operational.

Check the minor version, and if it is more than two versions lesser
or greater than the expected version, print a message indicating
that the NVM or driver should be updated respectively.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:33 -07:00
Jedrzej Jagielski c9e563cae1 ixgbe: add support for devlink reload
The E610 adapters contain an embedded chip with firmware which can be
updated using devlink flash. The firmware which runs on this chip is
referred to as the Embedded Management Processor firmware (EMP
firmware).

Activating the new firmware image currently requires that the system be
rebooted. This is not ideal as rebooting the system can cause unwanted
downtime.

The EMP firmware itself can be reloaded by issuing a special update
to the device called an Embedded Management Processor reset (EMP
reset). This reset causes the device to reset and reload the EMP
firmware.

Implement support for devlink reload with the "fw_activate" flag. This
allows user space to request the firmware be activated immediately.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Co-developed-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:33 -07:00
Jedrzej Jagielski a0f45672d5 ixgbe: add device flash update via devlink
Use the pldmfw library to implement device flash update for
the Intel ixgbe networking device driver specifically for E610 devices.
This support uses the devlink flash update interface.

Using the pldmfw library, the provided firmware file will be scanned for
the three major components, "fw.undi" for the Option ROM, "fw.mgmt" for
the main NVM module containing the primary device firmware, and
"fw.netlist" containing the netlist module.

The flash is separated into two banks, the active bank containing the
running firmware, and the inactive bank which we use for update. Each
module is updated in a staged process. First, the inactive bank is
erased, preparing the device for update. Second, the contents of the
component are copied to the inactive portion of the flash. After all
components are updated, the driver signals the device to switch the
active bank during the next EMP reset.

With this implementation, basic flash update for the E610 hardware is
supported.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Co-developed-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Jedrzej Jagielski 6eae2aeb60 ixgbe: extend .info_get() with stored versions
Add functions reading inactive versions from the inactive flash
bank.

Print stored versions for the content present in the inactive bank.
If there's pending update the versions reflect the ones which
are going to be loaded after reload. If there's no pending update
both running and stored are the same, which means there won't
be any NVM change on reload.

Co-developed-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Jedrzej Jagielski 4654ec6194 ixgbe: add E610 functions getting PBA and FW ver info
Introduce 2 E610 specific callbacks implementations:
-ixgbe_start_hw_e610() which expands the regular .start_hw callback with
getting FW version information
-ixgbe_read_pba_string_e610() which gets Product Board Assembly string

Extend EEPROM ops with new .read_pba_string in order to distinguish
generic one and the E610 one.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Jedrzej Jagielski 8210ff7380 ixgbe: add .info_get extension specific for E610 devices
E610 devices give possibility to show more detailed info than the previous
boards.
Extend reporting NVM info with following pieces:
 fw.mgmt.api -> version number of the API
 fw.mgmt.build -> identifier of the source for the FW
 fw.mgmt.srev -> number defining FW's security revision
 fw.psid.api -> version defining the format of the flash contents
 fw.undi.srev -> number defining OROM's security revision
 fw.netlist -> version of the netlist module
 fw.netlist.build -> first 4 bytes of the netlist hash

Co-developed-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Slawomir Mrozowicz 904c2b4c0b ixgbe: read the netlist version information
Add functions reading the netlist version info and use them
as a part of the setting NVM info procedure.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Slawomir Mrozowicz 70db0788a2 ixgbe: read the OROM version information
Add functions reading the OROM version info and use them
as a part of the setting NVM info procedure.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Slawomir Mrozowicz 5f214150c7 ixgbe: add E610 functions for acquiring flash data
Read NVM related info from the flash.

Add several helper functions used to access the flash data,
find memory banks, calculate offsets, calculate the flash size.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Slawomir Mrozowicz <slawomirx.mrozowicz@intel.com>
Co-developed-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Jedrzej Jagielski f6b588af3d ixgbe: add handler for devlink .info_get()
Provide devlink .info_get() callback implementation to allow the
driver to report detailed version information. The following info
is reported:

 "serial_number" -> The PCI DSN of the adapter
 "fw.bundle_id" -> Unique identifier for the combined flash image
 "fw.undi" -> Version of the Option ROM containing the UEFI driver
 "board.id" -> The PBA ID string

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Jedrzej Jagielski a0285236ab ixgbe: add initial devlink support
Add an initial support for devlink interface to ixgbe driver.

Similarly to i40e driver the implementation doesn't enable
devlink to manage device-wide configuration. Devlink instance
is created for each physical function of PCIe device.

Create separate directory for devlink related ixgbe files
and use naming scheme similar to the one used in the ice driver.

Add a stub for Documentation, to be extended by further patches.

Change struct ixgbe_adapter allocation to be done by devlink (Przemek),
as suggested by Jiri.

Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Co-developed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Przemek Kitszel fd5ef5203c ixgbe: wrap netdev_priv() usage
Wrap use of netdev_priv() in order to change the allocator of the device
private structure from alloc_etherdev_mq() to the devlink in next commit.

All but one netdev_priv() calls in the whole driver are replaced, the
remaining one is called on MACVLAN (so not ixgbe) device.

Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-15 07:36:32 -07:00
Piotr Wejman 39aa687a84 net: e1000e: convert to ndo_hwtstamp_get() and ndo_hwtstamp_set()
Update the driver to use the new hardware timestamping API added in commit
66f7223039 ("net: add NDOs for configuring hardware timestamping").
Use Netlink extack for error reporting in e1000e_config_hwtstamp.
Align the indentation of net_device_ops.

Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Signed-off-by: Piotr Wejman <wejmanpm@gmail.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:58 -07:00
Rui Salvaterra f9c961efb0 igc: enable HW vlan tag insertion/stripping by default
This is enabled by default in other Intel drivers I've checked (e1000, e1000e,
iavf, igb and ice). Fixes an out-of-the-box performance issue when running
OpenWrt on typical mini-PCs with igc-supported Ethernet controllers and 802.1Q
VLAN configurations, as ethtool isn't part of the default packages and sane
defaults are expected.

In my specific case, with an Intel N100-based machine with four I226-V Ethernet
controllers, my upload performance increased from under 30 Mb/s to the expected
~1 Gb/s.

Signed-off-by: Rui Salvaterra <rsalvaterra@gmail.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Kurt Kanzenbach <kurt@linutronix.de>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:58 -07:00
Rand Deeb cdcb3804ee ixgbe: Fix unreachable retry logic in combined and byte I2C write functions
The current implementation of `ixgbe_write_i2c_combined_generic_int` and
`ixgbe_write_i2c_byte_generic_int` sets `max_retry` to `1`, which makes
the condition `retry < max_retry` always evaluate to `false`. This renders
the retry mechanism ineffective, as the debug message and retry logic are
never executed.

This patch increases `max_retry` to `3` in both functions, aligning them
with the retry logic in `ixgbe_read_i2c_combined_generic_int`. This
ensures that the retry mechanism functions as intended, improving
robustness in case of I2C write failures.

Found by Linux Verification Center (linuxtesting.org) with SVACE.

Signed-off-by: Rand Deeb <rand.sec96@gmail.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:58 -07:00
Kyungwook Boo 015bac5dac i40e: fix MMIO write access to an invalid page in i40e_clear_hw
When the device sends a specific input, an integer underflow can occur, leading
to MMIO write access to an invalid page.

Prevent the integer underflow by changing the type of related variables.

Signed-off-by: Kyungwook Boo <bookyungwook@gmail.com>
Link: https://lore.kernel.org/lkml/ffc91764-1142-4ba2-91b6-8c773f6f7095@gmail.com/T/
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:57 -07:00
Colin Ian King fee4a79a12 ice: make const read-only array dflt_rules static
Don't populate the const read-only array dflt_rules on the stack at run
time, instead make it static.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:57 -07:00
Martyna Szapar-Mudlaw 6cb10c063d ice: improve error message for insufficient filter space
When adding a rule to switch through tc, if the operation fails
due to not enough free recipes (-ENOSPC), provide a clearer
error message: "Unable to add filter: insufficient space available."

This improves user feedback by distinguishing space limitations from
other generic failures.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 11:58:57 -07:00
Karol Kolacinski e2193f9f9e ice: enable timesync operation on 2xNAC E825 devices
According to the E825C specification, SBQ address for ports on a single
complex is device 2 for PHY 0 and device 13 for PHY1.
For accessing ports on a dual complex E825C (so called 2xNAC mode),
the driver should use destination device 2 (referred as phy_0) for
the current complex PHY and device 13 (referred as phy_0_peer) for
peer complex PHY.

Differentiate SBQ destination device by checking if current PF port
number is on the same PHY as target port number.

Adjust 'ice_get_lane_number' function to provide unique port number for
ports from PHY1 in 'dual' mode config (by adding fixed offset for PHY1
ports). Cache this value in ice_hw struct.

Introduce ice_get_primary_hw wrapper to get access to timesync register
not available from second NAC.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Karol Kolacinski 1fd9c91f7e ice: refactor ice_sbq_msg_dev enum
Rename ice_sbq_msg_dev to ice_sbq_dev_id to reflect the meaning of this
type more precisely. This enum type describes RDA (Remote Device Access)
client ids, accessible over SB (Side Band) interface.
Rename enum elements to make a driver namespace more cleaner and
consistent with other definitions within SB
Remove unused 'rmn_x' entries, specific to unsupported E824 device.
Adjust clients '2' and '13' names (phy_0 and phy_0_peer respectively) to
be compliant with EAS doc. According to the specification, regardless of
the complex entity (single or dual), when accessing its own ports,
they're accessed always as 'phy_0' client. And referred as 'phy_0_peer'
when handling ports connected to the other complex.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Karol Kolacinski 1e05c5a05d ice: remove SW side band access workaround for E825
Due to the bug in FW/NVM autoload mechanism (wrong default
SB_REM_DEV_CTL register settings), the access to peer PHY and CGU
clients was disabled by default.

As the workaround solution, the register value was overwritten by the
driver at the probe or reset handling.
Remove workaround as it's not needed anymore. The fix in autoload
procedure has been provided with NVM 3.80 version.

NOTE: at the time the fix was provided in NVM, the E825C product was
not officially available on the market, so it's not expected this change
will cause regression when running with older driver/kernel versions.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:46:37 -07:00
Larysa Zaremba 517f7a08ca ice: enable LLDP TX for VFs through tc
Only a single VSI can be in charge of sending LLDP frames, sometimes it is
beneficial to assign this function to a VF, that is possible to do with tc
capabilities in the switchdev mode. It requires first blocking the PF from
sending the LLDP frames with a following command:

tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop

Then it becomes possible to configure a forward rule from a VF port
representor to uplink instead.

tc filter add dev <vf_ifname> ingress protocol lldp flower skip_sw
action mirred egress redirect dev <ifname>

How LLDP exclusivity was done previously is LLDP traffic was blocked for a
whole port by a single rule and PF was bypassing that. Now at least in the
switchdev mode, every separate VSI has to have its own drop rule. Another
complication is the fact that tc does not respect when the driver refuses
to delete a rule, so returning an error results in a HW rule still present
with no way to reference it through tc. This is addressed by allowing the
PF rule to be deleted at any time, but making the VF forward rule "dormant"
in such case, this means it is deleted from HW but stays in tc and driver's
bookkeeping to be restored when drop rule is added back to the PF.

Implement tc configuration handling which enables the user to transmit LLDP
packets from VF instead of PF.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 10:45:52 -07:00
Larysa Zaremba 40f42dc1cb ice: support egress drop rules on PF
tc clsact qdisc allows us to add offloaded egress rules with commands such
as the following one:

tc filter add dev <ifname> egress protocol lldp flower skip_sw action drop

Support the egress rule drop action when added to PF, with a few caveats:
* in switchdev mode, all PF traffic has to go uplink with an exception for
  LLDP that can be delegated to a single VSI at a time
* in legacy mode, we cannot delegate LLDP functionality to another VSI, so
  such packets from PF should not be blocked.

Also, simplify the rule direction logic, it was previously derived from
actions, but actually can be inherited from the tc block (and flipped in
case of port representors).

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:47:43 -07:00
Larysa Zaremba 5787179c51 ice: remove headers argument from ice_tc_count_lkups
Remove the headers argument from the ice_tc_count_lkups() function, because
it is not used anywhere.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:46:37 -07:00
Mateusz Pacuszka 2296345416 ice: receive LLDP on trusted VFs
When a trusted VF tries to configure an LLDP multicast address, configure a
rule that would mirror the traffic to this VF, untrusted VFs are not
allowed to receive LLDP at all, so the request to add LLDP MAC address will
always fail for them.

Add a forwarding LLDP filter to a trusted VF when it tries to add an LLDP
multicast MAC address. The MAC address has to be added after enabling
trust (through restarting the LLDP service).

Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com>
Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:45:30 -07:00
Larysa Zaremba 4d5a1c4e6d ice: do not add LLDP-specific filter if not necessary
Commit 34295a3696 ("ice: implement new LLDP filter command")
introduced the ability to use LLDP-specific filter that directs all
LLDP traffic to a single VSI. However, current goal is for all trusted VFs
to be able to see LLDP neighbors, which is impossible to do with the
special filter.

Make using the generic filter the default choice and fall back to special
one only if a generic filter cannot be added. That way setups with "NVMs
where an already existent LLDP filter is blocking the creation of a filter
to allow LLDP packets" will still be able to configure software Rx LLDP on
PF only, while all other setups would be able to forward them to VFs too.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:44:26 -07:00
Mateusz Pacuszka a808691df3 ice: fix check for existing switch rule
In case the rule already exists and another VSI wants to subscribe to it
new VSI list is being created and both VSIs are moved to it.
Currently, the check for already existing VSI with the same rule is done
based on fdw_id.hw_vsi_id, which applies only to LOOKUP_RX flag.
Change it to vsi_handle. This is software VSI ID, but it can be applied
here, because vsi_map itself is also based on it.

Additionally change return status in case the VSI already exists in the
VSI map to "Already exists". Such case should be handled by the caller.

Signed-off-by: Mateusz Pacuszka <mateuszx.pacuszka@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:32:31 -07:00
Christopher S M Hall 1a931c4f5e igc: add lock preventing multiple simultaneous PTM transactions
Add a mutex around the PTM transaction to prevent multiple transactors

Multiple processes try to initiate a PTM transaction, one or all may
fail. This can be reproduced by running two instances of the
following:

$ sudo phc2sys -O 0 -i tsn0 -m

PHC2SYS exits with:

"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
 fails

Note: Normally two instance of PHC2SYS will not run, but one process
 should not break another.

Fixes: a90ec84837 ("igc: Add support for PTP getcrosststamp()")
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:28 -07:00
Christopher S M Hall 1f025759ba igc: cleanup PTP module if probe fails
Make sure that the PTP module is cleaned up if the igc_probe() fails by
calling igc_ptp_stop() on exit.

Fixes: d89f88419f ("igc: Add skeletal frame for Intel(R) 2.5G Ethernet Controller support")
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:28 -07:00
Christopher S M Hall 26a3910afd igc: handle the IGC_PTP_ENABLED flag correctly
All functions in igc_ptp.c called from igc_main.c should check the
IGC_PTP_ENABLED flag. Adding check for this flag to stop and reset
functions.

Fixes: 5f2958052c ("igc: Add basic skeleton for PTP")
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:28 -07:00
Christopher S M Hall cd7f7328d6 igc: move ktime snapshot into PTM retry loop
Move ktime_get_snapshot() into the loop. If a retry does occur, a more
recent snapshot will result in a more accurate cross-timestamp.

Fixes: a90ec84837 ("igc: Add support for PTP getcrosststamp()")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:28 -07:00
Christopher S M Hall 714cd033da igc: increase wait time before retrying PTM
The i225/i226 hardware retries if it receives an inappropriate response
from the upstream device. If the device retries too quickly, the root
port does not respond.

The wait between attempts was reduced from 10us to 1us in commit
6b8aa753a9 ("igc: Decrease PTM short interval from 10 us to 1 us"), which
said:

  With the 10us interval, we were seeing PTM transactions take around
  12us. Hardware team suggested this interval could be lowered to 1us
  which was confirmed with PCIe sniffer. With the 1us interval, PTM
  dialogs took around 2us.

While a 1us short cycle time was thought to be theoretically sufficient, it
turns out in practice it is not quite long enough. It is unclear if the
problem is in the root port or an issue in i225/i226.

Increase the wait from 1us to 4us. Increasing to 2us appeared to work in
practice on the setups we have available. A value of 4us was chosen due to
the limited hardware available for testing, with a goal of ensuring we wait
long enough without overly penalizing the response time when unnecessary.

The issue can be reproduced with the following:

$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m

Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.

PHC2SYS exits with:

"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
  fails

Fixes: 6b8aa753a9 ("igc: Decrease PTM short interval from 10 us to 1 us")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:28 -07:00
Christopher S M Hall 8e404ad95d igc: fix PTM cycle trigger logic
Writing to clear the PTM status 'valid' bit while the PTM cycle is
triggered results in unreliable PTM operation. To fix this, clear the
PTM 'trigger' and status after each PTM transaction.

The issue can be reproduced with the following:

$ sudo phc2sys -R 1000 -O 0 -i tsn0 -m

Note: 1000 Hz (-R 1000) is unrealistically large, but provides a way to
quickly reproduce the issue.

PHC2SYS exits with:

"ioctl PTP_OFFSET_PRECISE: Connection timed out" when the PTM transaction
  fails

This patch also fixes a hang in igc_probe() when loading the igc
driver in the kdump kernel on systems supporting PTM.

The igc driver running in the base kernel enables PTM trigger in
igc_probe().  Therefore the driver is always in PTM trigger mode,
except in brief periods when manually triggering a PTM cycle.

When a crash occurs, the NIC is reset while PTM trigger is enabled.
Due to a hardware problem, the NIC is subsequently in a bad busmaster
state and doesn't handle register reads/writes.  When running
igc_probe() in the kdump kernel, the first register access to a NIC
register hangs driver probing and ultimately breaks kdump.

With this patch, igc has PTM trigger disabled most of the time,
and the trigger is only enabled for very brief (10 - 100 us) periods
when manually triggering a PTM cycle.  Chances that a crash occurs
during a PTM trigger are not 0, but extremely reduced.

Fixes: a90ec84837 ("igc: Add support for PTP getcrosststamp()")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Christopher S M Hall <christopher.s.hall@intel.com>
Reviewed-by: Corinna Vinschen <vinschen@redhat.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Corinna Vinschen <vinschen@redhat.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-11 09:13:11 -07:00
Thomas Gleixner 8fa7292fee treewide: Switch/rename to timer_delete[_sync]()
timer_delete[_sync]() replaces del_timer[_sync](). Convert the whole tree
over and remove the historical wrapper inlines.

Conversion was done with coccinelle plus manual fixups where necessary.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2025-04-05 10:30:12 +02:00
Emil Tantilov 4c9106f490 idpf: fix adapter NULL pointer dereference on reboot
With SRIOV enabled, idpf ends up calling into idpf_remove() twice.
First via idpf_shutdown() and then again when idpf_remove() calls into
sriov_disable(), because the VF devices use the idpf driver, hence the
same remove routine. When that happens, it is possible for the adapter
to be NULL from the first call to idpf_remove(), leading to a NULL
pointer dereference.

echo 1 > /sys/class/net/<netif>/device/sriov_numvfs
reboot

BUG: kernel NULL pointer dereference, address: 0000000000000020
...
RIP: 0010:idpf_remove+0x22/0x1f0 [idpf]
...
? idpf_remove+0x22/0x1f0 [idpf]
? idpf_remove+0x1e4/0x1f0 [idpf]
pci_device_remove+0x3f/0xb0
device_release_driver_internal+0x19f/0x200
pci_stop_bus_device+0x6d/0x90
pci_stop_and_remove_bus_device+0x12/0x20
pci_iov_remove_virtfn+0xbe/0x120
sriov_disable+0x34/0xe0
idpf_sriov_configure+0x58/0x140 [idpf]
idpf_remove+0x1b9/0x1f0 [idpf]
idpf_shutdown+0x12/0x30 [idpf]
pci_device_shutdown+0x35/0x60
device_shutdown+0x156/0x200
...

Replace the direct idpf_remove() call in idpf_shutdown() with
idpf_vc_core_deinit() and idpf_deinit_dflt_mbx(), which perform
the bulk of the cleanup, such as stopping the init task, freeing IRQs,
destroying the vports and freeing the mailbox. This avoids the calls to
sriov_disable() in addition to a small netdev cleanup, and destroying
workqueues, which don't seem to be required on shutdown.

Reported-by: Yuying Ma <yuma@redhat.com>
Fixes: e850efed5e ("idpf: add module register and probe functionality")
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-02 08:51:41 -07:00
Piotr Kwapulinski 40206599be ixgbe: fix media type detection for E610 device
The commit 23c0e5a16b ("ixgbe: Add link management support for E610
device") introduced incorrect media type detection for E610 device. It
reproduces when advertised speed is modified after driver reload. Clear
the previous outdated PHY type high value.

Reproduction steps:
modprobe ixgbe
ethtool -s eth0 advertise 0x1000000000000
modprobe -r ixgbe
modprobe ixgbe
ethtool -s eth0 advertise 0x1000000000000
Result before the fix:
netlink error: link settings update failed
netlink error: Invalid argument
Result after the fix:
No output error

Fixes: 23c0e5a16b ("ixgbe: Add link management support for E610 device")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-02 08:51:41 -07:00
Vitaly Lifshits efaaf344bc e1000e: change k1 configuration on MTP and later platforms
Starting from Meteor Lake, the Kumeran interface between the integrated
MAC and the I219 PHY works at a different frequency. This causes sporadic
MDI errors when accessing the PHY, and in rare circumstances could lead
to packet corruption.

To overcome this, introduce minor changes to the Kumeran idle
state (K1) parameters during device initialization. Hardware reset
reverts this configuration, therefore it needs to be applied in a few
places.

Fixes: cc23f4f0b6 ("e1000e: Add support for Meteor Lake")
Signed-off-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-02 08:51:41 -07:00
Zdenek Bouska d931cf9b38 igc: Fix TX drops in XDP ZC
Fixes TX frame drops in AF_XDP zero copy mode when budget < 4.
xsk_tx_peek_desc() consumed TX frame and it was ignored because of
low budget. Not even AF_XDP completion was done for dropped frames.

It can be reproduced on i226 by sending 100000x 60 B frames with
launch time set to minimal IPG (672 ns between starts of frames)
on 1Gbit/s. Always 1026 frames are not sent and are missing a
completion.

Fixes: 9acf59a752 ("igc: Enable TX via AF_XDP zero-copy")
Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com>
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-02 08:51:41 -07:00
Joe Damato dddeeaa16c igc: Fix XSK queue NAPI ID mapping
In commit b65969856d ("igc: Link queues to NAPI instances"), the XSK
queues were incorrectly unmapped from their NAPI instances. After
discussion on the mailing list and the introduction of a test to codify
the expected behavior, we can see that the unmapping causes the
check_xsk test to fail:

NETIF=enp86s0 ./tools/testing/selftests/drivers/net/queues.py

[...]
  # Check|     ksft_eq(q.get('xsk', None), {},
  # Check failed None != {} xsk attr on queue we configured
  not ok 4 queues.check_xsk

After this commit, the test passes:

  ok 4 queues.check_xsk

Note that the test itself is only in net-next, so I tested this change
by applying it to my local net-next tree, booting, and running the test.

Cc: stable@vger.kernel.org
Fixes: b65969856d ("igc: Link queues to NAPI instances")
Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-04-02 08:51:41 -07:00
Linus Torvalds 1a9239bb42 Networking changes for 6.15.
Core & protocols
 ----------------
 
  - Continue Netlink conversions to per-namespace RTNL lock
    (IPv4 routing, routing rules, routing next hops, ARP ioctls).
 
  - Continue extending the use of netdev instance locks. As a driver
    opt-in protect queue operations and (in due course) ethtool
    operations with the instance lock and not RTNL lock.
 
  - Support collecting TCP timestamps (data submitted, sent, acked)
    in BPF, allowing for transparent (to the application) and lower
    overhead tracking of TCP RPC performance.
 
  - Tweak existing networking Rx zero-copy infra to support zero-copy
    Rx via io_uring.
 
  - Optimize MPTCP performance in single subflow mode by 29%.
 
  - Enable GRO on packets which went thru XDP CPU redirect (were queued
    for processing on a different CPU). Improving TCP stream performance
    up to 2x.
 
  - Improve performance of contended connect() by 200% by searching
    for an available 4-tuple under RCU rather than a spin lock.
    Bring an additional 229% improvement by tweaking hash distribution.
 
  - Avoid unconditionally touching sk_tsflags on RX, improving
    performance under UDP flood by as much as 10%.
 
  - Avoid skb_clone() dance in ping_rcv() to improve performance under
    ping flood.
 
  - Avoid FIB lookup in netfilter if socket is available, 20% perf win.
 
  - Rework network device creation (in-kernel) API to more clearly
    identify network namespaces and their roles.
    There are up to 4 namespace roles but we used to have just 2 netns
    pointer arguments, interpreted differently based on context.
 
  - Use sysfs_break_active_protection() instead of trylock to avoid
    deadlocks between unregistering objects and sysfs access.
 
  - Add a new sysctl and sockopt for capping max retransmit timeout
    in TCP.
 
  - Support masking port and DSCP in routing rule matches.
 
  - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.
 
  - Support specifying at what time packet should be sent on AF_XDP
    sockets.
 
  - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin users.
 
  - Add Netlink YAML spec for WiFi (nl80211) and conntrack.
 
  - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
    which only need to be exported when IPv6 support is built as a module.
 
  - Age FDB entries based on Rx not Tx traffic in VxLAN, similar
    to normal bridging.
 
  - Allow users to specify source port range for GENEVE tunnels.
 
  - netconsole: allow attaching kernel release, CPU ID and task name
    to messages as metadata
 
 Driver API
 ----------
 
  - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
    the SW layers. Delegate the responsibilities to phylink where possible.
    Improve its handling in phylib.
 
  - Support symmetric OR-XOR RSS hashing algorithm.
 
  - Support tracking and preserving IRQ affinity by NAPI itself.
 
  - Support loopback mode speed selection for interface selftests.
 
 Device drivers
 --------------
 
  - Remove the IBM LCS driver for s390.
 
  - Remove the sb1000 cable modem driver.
 
  - Add support for SFP module access over SMBus.
 
  - Add MCTP transport driver for MCTP-over-USB.
 
  - Enable XDP metadata support in multiple drivers.
 
  - Ethernet high-speed NICs:
    - Broadcom (bnxt):
      - add PCIe TLP Processing Hints (TPH) support for new AMD platforms
      - support dumping RoCE queue state for debug
      - opt into instance locking
    - Intel (100G, ice, idpf):
      - ice: rework MSI-X IRQ management and distribution
      - ice: support for E830 devices
      - iavf: add support for Rx timestamping
      - iavf: opt into instance locking
    - nVidia/Mellanox:
      - mlx4: use page pool memory allocator for Rx
      - mlx5: support for one PTP device per hardware clock
      - mlx5: support for 200Gbps per-lane link modes
      - mlx5: move IPSec policy check after decryption
    - AMD/Solarflare:
      - support FW flashing via devlink
    - Cisco (enic):
      - use page pool memory allocator for Rx
      - enable 32, 64 byte CQEs
      - get max rx/tx ring size from the device
    - Meta (fbnic):
      - support flow steering and RSS configuration
      - report queue stats
      - support TCP segmentation
      - support IRQ coalescing
      - support ring size configuration
    - Marvell/Cavium:
      - support AF_XDP
    - Wangxun:
      - support for PTP clock and timestamping
    - Huawei (hibmcge):
      - checksum offload
      - add more statistics
 
  - Ethernet virtual:
    - VirtIO net:
      - aggressively suppress Tx completions, improve perf by 96% with
        1 CPU and 55% with 2 CPUs
      - expose NAPI to IRQ mapping and persist NAPI settings
    - Google (gve):
      - support XDP in DQO RDA Queue Format
      - opt into instance locking
    - Microsoft vNIC:
      - support BIG TCP
 
  - Ethernet NICs consumer, and embedded:
    - Synopsys (stmmac):
      - cleanup Tx and Tx clock setting and other link-focused cleanups
      - enable SGMII and 2500BASEX mode switching for Intel platforms
      - support Sophgo SG2044
    - Broadcom switches (b53):
      - support for BCM53101
    - TI:
      - iep: add perout configuration support
      - icssg: support XDP
    - Cadence (macb):
      - implement BQL
    - Xilinx (axinet):
      - support dynamic IRQ moderation and changing coalescing at runtime
      - implement BQL
      - report standard stats
    - MediaTek:
      - support phylink managed EEE
    - Intel:
      - igc: don't restart the interface on every XDP program change
    - RealTek (r8169):
      - support reading registers of internal PHYs directly
      - increase max jumbo packet size on RTL8125/RTL8126
    - Airoha:
      - support for RISC-V NPU packet processing unit
      - enable scatter-gather and support MTU up to 9kB
    - Tehuti (tn40xx):
      - support cards with TN4010 MAC and an Aquantia AQR105 PHY
 
  - Ethernet PHYs:
    - support for TJA1102S, TJA1121
    - dp83tg720: add randomized polling intervals for link detection
    - dp83822: support changing the transmit amplitude voltage
    - support for LEDs on 88q2xxx
 
  - CAN:
    - canxl: support Remote Request Substitution bit access
    - flexcan: add S32G2/S32G3 SoC
 
  - WiFi:
    - remove cooked monitor support
    - strict mode for better AP testing
    - basic EPCS support
    - OMI RX bandwidth reduction support
    - batman-adv: add support for jumbo frames
 
  - WiFi drivers:
    - RealTek (rtw88):
      - support RTL8814AE and RTL8814AU
    - RealTek (rtw89):
      - switch using wiphy_lock and wiphy_work
      - add BB context to manipulate two PHY as preparation of MLO
      - improve BT-coexistence mechanism to play A2DP smoothly
    - Intel (iwlwifi):
      - add new iwlmld sub-driver for latest HW/FW combinations
    - MediaTek (mt76):
      - preparation for mt7996 Multi-Link Operation (MLO) support
    - Qualcomm/Atheros (ath12k):
      - continued work on MLO
    - Silabs (wfx):
      - Wake-on-WLAN support
 
  - Bluetooth:
    - add support for skb TX SND/COMPLETION timestamping
    - hci_core: enable buffer flow control for SCO/eSCO
    - coredump: log devcd dumps into the monitor
 
  - Bluetooth drivers:
    - intel: add support to configure TX power
    - nxp: handle bootloader error during cmd5 and cmd7
 
 Signed-off-by: Jakub Kicinski <kuba@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmfkLC8ACgkQMUZtbf5S
 Irsb5g/+L7oKOf0ALbaV9kxFsoz8AymZfAW9i/27F07omGJGpks8oX6j6rQLgIRO
 OQOFcp7XEdDh1+jh82gHVuPrw2/6lchLtW8ARtzdiQKFr5DRjrsbtua6GRc8iBqA
 DIRCBFoV2HuMkF39Vr09HMa9AZAT7QR2RLsRGpSq8E8Z8xxKz0X7oujs10PFpMTE
 IVKhTrVrk+NDot/IU2hzVpnpup+0ld+T2/ZaBklJGcU8uDffImsqNepHRyCG5UC3
 xz74Ju23MAj24Gct+og0yFUooF+lUltKyVm0FYCDCY3bASTwgY01NR3kEH/0NQvM
 cywLzd/ngHm/SMD2ggVAHkjZUieiIVHdaZ53dgjDeBOQoVP6p0dgUK7EumXX8Mx4
 8ReR2UiGoYRPaq9c4o+IjG4K027MwVK2p+mF1a6MLa+20XcyMbev8FIRbbHtC/V4
 z5/FsOAxcuICWkA1hU9bODrrGzIqemmdRgKG8sGuTJCt/kYGAn72/TCATGNSaCJ0
 00n2jN1aepa7wtywHJ5MhVzxN9iQX7+geUHXz0BI+lK4e1Pmk+vjGksymb9ai2fk
 eQAUV9ekub6q68/J16scD7XeOUM37bTLiMBQeIF8UtZBOJscKiS71zn9QP9Twwxv
 P2pm01RDZUI+z5ZX3hc12Pm1vjRHaAh9S1JpAw/pTOVlQ+mAJEM=
 =XY0S
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Jakub Kicinski:
 "Core & protocols:

   - Continue Netlink conversions to per-namespace RTNL lock
     (IPv4 routing, routing rules, routing next hops, ARP ioctls)

   - Continue extending the use of netdev instance locks. As a driver
     opt-in protect queue operations and (in due course) ethtool
     operations with the instance lock and not RTNL lock.

   - Support collecting TCP timestamps (data submitted, sent, acked) in
     BPF, allowing for transparent (to the application) and lower
     overhead tracking of TCP RPC performance.

   - Tweak existing networking Rx zero-copy infra to support zero-copy
     Rx via io_uring.

   - Optimize MPTCP performance in single subflow mode by 29%.

   - Enable GRO on packets which went thru XDP CPU redirect (were queued
     for processing on a different CPU). Improving TCP stream
     performance up to 2x.

   - Improve performance of contended connect() by 200% by searching for
     an available 4-tuple under RCU rather than a spin lock. Bring an
     additional 229% improvement by tweaking hash distribution.

   - Avoid unconditionally touching sk_tsflags on RX, improving
     performance under UDP flood by as much as 10%.

   - Avoid skb_clone() dance in ping_rcv() to improve performance under
     ping flood.

   - Avoid FIB lookup in netfilter if socket is available, 20% perf win.

   - Rework network device creation (in-kernel) API to more clearly
     identify network namespaces and their roles. There are up to 4
     namespace roles but we used to have just 2 netns pointer arguments,
     interpreted differently based on context.

   - Use sysfs_break_active_protection() instead of trylock to avoid
     deadlocks between unregistering objects and sysfs access.

   - Add a new sysctl and sockopt for capping max retransmit timeout in
     TCP.

   - Support masking port and DSCP in routing rule matches.

   - Support dumping IPv4 multicast addresses with RTM_GETMULTICAST.

   - Support specifying at what time packet should be sent on AF_XDP
     sockets.

   - Expose TCP ULP diagnostic info (for TLS and MPTCP) to non-admin
     users.

   - Add Netlink YAML spec for WiFi (nl80211) and conntrack.

   - Introduce EXPORT_IPV6_MOD() and EXPORT_IPV6_MOD_GPL() for symbols
     which only need to be exported when IPv6 support is built as a
     module.

   - Age FDB entries based on Rx not Tx traffic in VxLAN, similar to
     normal bridging.

   - Allow users to specify source port range for GENEVE tunnels.

   - netconsole: allow attaching kernel release, CPU ID and task name to
     messages as metadata

  Driver API:

   - Continue rework / fixing of Energy Efficient Ethernet (EEE) across
     the SW layers. Delegate the responsibilities to phylink where
     possible. Improve its handling in phylib.

   - Support symmetric OR-XOR RSS hashing algorithm.

   - Support tracking and preserving IRQ affinity by NAPI itself.

   - Support loopback mode speed selection for interface selftests.

  Device drivers:

   - Remove the IBM LCS driver for s390

   - Remove the sb1000 cable modem driver

   - Add support for SFP module access over SMBus

   - Add MCTP transport driver for MCTP-over-USB

   - Enable XDP metadata support in multiple drivers

   - Ethernet high-speed NICs:
      - Broadcom (bnxt):
         - add PCIe TLP Processing Hints (TPH) support for new AMD
           platforms
         - support dumping RoCE queue state for debug
         - opt into instance locking
      - Intel (100G, ice, idpf):
         - ice: rework MSI-X IRQ management and distribution
         - ice: support for E830 devices
         - iavf: add support for Rx timestamping
         - iavf: opt into instance locking
      - nVidia/Mellanox:
         - mlx4: use page pool memory allocator for Rx
         - mlx5: support for one PTP device per hardware clock
         - mlx5: support for 200Gbps per-lane link modes
         - mlx5: move IPSec policy check after decryption
      - AMD/Solarflare:
         - support FW flashing via devlink
      - Cisco (enic):
         - use page pool memory allocator for Rx
         - enable 32, 64 byte CQEs
         - get max rx/tx ring size from the device
      - Meta (fbnic):
         - support flow steering and RSS configuration
         - report queue stats
         - support TCP segmentation
         - support IRQ coalescing
         - support ring size configuration
      - Marvell/Cavium:
         - support AF_XDP
      - Wangxun:
         - support for PTP clock and timestamping
      - Huawei (hibmcge):
         - checksum offload
         - add more statistics

   - Ethernet virtual:
      - VirtIO net:
         - aggressively suppress Tx completions, improve perf by 96%
           with 1 CPU and 55% with 2 CPUs
         - expose NAPI to IRQ mapping and persist NAPI settings
      - Google (gve):
         - support XDP in DQO RDA Queue Format
         - opt into instance locking
      - Microsoft vNIC:
         - support BIG TCP

   - Ethernet NICs consumer, and embedded:
      - Synopsys (stmmac):
         - cleanup Tx and Tx clock setting and other link-focused
           cleanups
         - enable SGMII and 2500BASEX mode switching for Intel platforms
         - support Sophgo SG2044
      - Broadcom switches (b53):
         - support for BCM53101
      - TI:
         - iep: add perout configuration support
         - icssg: support XDP
      - Cadence (macb):
         - implement BQL
      - Xilinx (axinet):
         - support dynamic IRQ moderation and changing coalescing at
           runtime
         - implement BQL
         - report standard stats
      - MediaTek:
         - support phylink managed EEE
      - Intel:
         - igc: don't restart the interface on every XDP program change
      - RealTek (r8169):
         - support reading registers of internal PHYs directly
         - increase max jumbo packet size on RTL8125/RTL8126
      - Airoha:
         - support for RISC-V NPU packet processing unit
         - enable scatter-gather and support MTU up to 9kB
      - Tehuti (tn40xx):
         - support cards with TN4010 MAC and an Aquantia AQR105 PHY

   - Ethernet PHYs:
      - support for TJA1102S, TJA1121
      - dp83tg720: add randomized polling intervals for link detection
      - dp83822: support changing the transmit amplitude voltage
      - support for LEDs on 88q2xxx

   - CAN:
      - canxl: support Remote Request Substitution bit access
      - flexcan: add S32G2/S32G3 SoC

   - WiFi:
      - remove cooked monitor support
      - strict mode for better AP testing
      - basic EPCS support
      - OMI RX bandwidth reduction support
      - batman-adv: add support for jumbo frames

   - WiFi drivers:
      - RealTek (rtw88):
         - support RTL8814AE and RTL8814AU
      - RealTek (rtw89):
         - switch using wiphy_lock and wiphy_work
         - add BB context to manipulate two PHY as preparation of MLO
         - improve BT-coexistence mechanism to play A2DP smoothly
      - Intel (iwlwifi):
         - add new iwlmld sub-driver for latest HW/FW combinations
      - MediaTek (mt76):
         - preparation for mt7996 Multi-Link Operation (MLO) support
      - Qualcomm/Atheros (ath12k):
         - continued work on MLO
      - Silabs (wfx):
         - Wake-on-WLAN support

   - Bluetooth:
      - add support for skb TX SND/COMPLETION timestamping
      - hci_core: enable buffer flow control for SCO/eSCO
      - coredump: log devcd dumps into the monitor

   - Bluetooth drivers:
      - intel: add support to configure TX power
      - nxp: handle bootloader error during cmd5 and cmd7"

* tag 'net-next-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1681 commits)
  unix: fix up for "apparmor: add fine grained af_unix mediation"
  mctp: Fix incorrect tx flow invalidation condition in mctp-i2c
  net: usb: asix: ax88772: Increase phy_name size
  net: phy: Introduce PHY_ID_SIZE — minimum size for PHY ID string
  net: libwx: fix Tx L4 checksum
  net: libwx: fix Tx descriptor content for some tunnel packets
  atm: Fix NULL pointer dereference
  net: tn40xx: add pci-id of the aqr105-based Tehuti TN4010 cards
  net: tn40xx: prepare tn40xx driver to find phy of the TN9510 card
  net: tn40xx: create swnode for mdio and aqr105 phy and add to mdiobus
  net: phy: aquantia: add essential functions to aqr105 driver
  net: phy: aquantia: search for firmware-name in fwnode
  net: phy: aquantia: add probe function to aqr105 for firmware loading
  net: phy: Add swnode support to mdiobus_scan
  gve: add XDP DROP and PASS support for DQ
  gve: update XDP allocation path support RX buffer posting
  gve: merge packet buffer size fields
  gve: update GQ RX to use buf_size
  gve: introduce config-based allocation for XDP
  gve: remove xdp_xsk_done and xdp_xsk_wakeup statistics
  ...
2025-03-26 21:48:21 -07:00
Jakub Kicinski 023b1e9d26 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in late fixes to prepare for the 6.15 net-next PR.

No conflicts, adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  919f9f497d ("eth: bnxt: fix out-of-range access of vnic_info array")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-26 09:32:10 -07:00
Linus Torvalds a50b4fe095 A treewide hrtimer timer cleanup
hrtimers are initialized with hrtimer_init() and a subsequent store to
   the callback pointer. This turned out to be suboptimal for the upcoming
   Rust integration and is obviously a silly implementation to begin with.
 
   This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
   with hrtimer_setup(T, cb);
 
   The conversion was done with Coccinelle and a few manual fixups.
 
   Once the conversion has completely landed in mainline, hrtimer_init()
   will be removed and the hrtimer::function becomes a private member.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmff5jQTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoVvRD/wKtuwmiA66NJFgXC0qVq82A6fO3bY8
 GBdbfysDJIbqGu5PTcULTbJ8qkqv3jeLUv6CcXvS4sZ7y/uJQl2lzf8yrD/0bbwc
 rLI6sHiPSZmK93kNVN4X5H7kvt7cE/DYC9nnEOgK3BY5FgKc4n9887d4aVBhL8Lv
 ODwVXvZ+xi351YCj7qRyPU24zt/p4tkkT1o2k4a0HBluqLI0D+V20fke9IERUL8r
 d1uWKlcn0TqYDesE8HXKIhbst3gx52rMJrXBJDHwFmG6v8Pj1fkTXCVpPo8QcBz8
 OTVkpomN9f/Tx4+GZwhZOF86LhLL3OhxD6pT7JhFCXdmSGv+Ez8uyk1YZysM/XpV
 Juy/1yAcBpDIDkmhMFGdAAn48Nn9Fotty0r4je60zSEp1d/4QMXcFme29qr2JTUE
 iWnQ/HD6DxUjVHqy7CYvvo26Xegg1C7qgyOVt4PYZwAM1VKF5P3kzYTb4SAdxtop
 Tpji1sfW9QV08jqMNo6XntD32DSP9S2HqjO9LwBw700jnx2jjJ35fcJs6iodMOUn
 gckIZLMn3L0OoglPdyA5O7SNTbKE7aFiRKdnT/cJtR3Fa39Qu27CwC5gfiyuie9I
 Q+LG8GLuYSBHXAR+PBK4GWlzJ7Dn8k3eqmbnLeKpRMsU6ZzcttgA64xhaviN2wN0
 iJbvLJeisXr3GA==
 =bYAX
 -----END PGP SIGNATURE-----

Merge tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer cleanups from Thomas Gleixner:
 "A treewide hrtimer timer cleanup

  hrtimers are initialized with hrtimer_init() and a subsequent store to
  the callback pointer. This turned out to be suboptimal for the
  upcoming Rust integration and is obviously a silly implementation to
  begin with.

  This cleanup replaces the hrtimer_init(T); T->function = cb; sequence
  with hrtimer_setup(T, cb);

  The conversion was done with Coccinelle and a few manual fixups.

  Once the conversion has completely landed in mainline, hrtimer_init()
  will be removed and the hrtimer::function becomes a private member"

* tag 'timers-cleanups-2025-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (100 commits)
  wifi: rt2x00: Switch to use hrtimer_update_function()
  io_uring: Use helper function hrtimer_update_function()
  serial: xilinx_uartps: Use helper function hrtimer_update_function()
  ASoC: fsl: imx-pcm-fiq: Switch to use hrtimer_setup()
  RDMA: Switch to use hrtimer_setup()
  virtio: mem: Switch to use hrtimer_setup()
  drm/vmwgfx: Switch to use hrtimer_setup()
  drm/xe/oa: Switch to use hrtimer_setup()
  drm/vkms: Switch to use hrtimer_setup()
  drm/msm: Switch to use hrtimer_setup()
  drm/i915/request: Switch to use hrtimer_setup()
  drm/i915/uncore: Switch to use hrtimer_setup()
  drm/i915/pmu: Switch to use hrtimer_setup()
  drm/i915/perf: Switch to use hrtimer_setup()
  drm/i915/gvt: Switch to use hrtimer_setup()
  drm/i915/huc: Switch to use hrtimer_setup()
  drm/amdgpu: Switch to use hrtimer_setup()
  stm class: heartbeat: Switch to use hrtimer_setup()
  i2c: Switch to use hrtimer_setup()
  iio: Switch to use hrtimer_setup()
  ...
2025-03-25 10:54:15 -07:00
Jakub Kicinski 586b7b3ebb ipsec-next-2025-03-24
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH7ZpcWbFyOOp6OJbrB3Eaf9PW7cFAmfg9oUACgkQrB3Eaf9P
 W7d8eRAAhUdQztUoVjNfRBScD34EHEBq8ruFMgbHcXkkdhg223CUaTYKqz3dYE1E
 04OCSwe7yxFFSs7CKR4dRMSzcx7uH/ISPprk45g+wwoIzBCYOWjLBzS5qmamtSYu
 E3sI/QZaVUU7mFDg5n3sr5nkCNz+LtTL2dUbOgi5AOqY9h7LZRzLBEB1RCVOioVO
 14vDRBJ5RGIvmNx3pd0sRhB4BySUw1GQVTYLSFaL/V2JZcRu3yGhKT8q637fdKcS
 blvdf5q4CYyu5i3p9LHhsg33D8fYUotBVFtzw3QaM1G9MdyhMF4WUbeZCa2kkk7J
 N7j2N0WDjP7crKQkPbm9ATQiruF2zvPcYJqgIHKuNhWY35KNIFARWxbP8KGJJFUI
 EqUUKuT0lUi39hULmm5YEqCGWLNQldTBOOZL1fccRj2DZelx4uhQsnt5B2yimUCD
 7QJGH0Vx6NJAutuCgTXQtmD0qO2XYi50XLmh4K+HZ1M3DkSGohcYSz+aj8IIY/wT
 +yT55uRoONqDbh9kOLHab0WWE6R+XJyY5rKnz6XpzyTPh0OwoxrbVXJF045NYIpB
 tSl9ykVpymvWNT56nsfXEOm5THZhdvL+uDS/QV4rtF9En4TEJU0dGbhESvriuu2j
 g6EAQpYJqL9OwA5aKVZCGlHf7siji7Z/uc57DADZ8mtmDXVMqco=
 =9RMg
 -----END PGP SIGNATURE-----

Merge tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2025-03-24

1) Prevent setting high order sequence number bits input in
   non-ESN mode. From Leon Romanovsky.

2) Support PMTU handling in tunnel mode for packet offload.
   From Leon Romanovsky.

3) Make xfrm_state_lookup_byaddr lockless.
   From Florian Westphal.

4) Remove unnecessary NULL check in xfrm_lookup_with_ifid().
   From Dan Carpenter.

* tag 'ipsec-next-2025-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next:
  xfrm: Remove unnecessary NULL check in xfrm_lookup_with_ifid()
  xfrm: state: make xfrm_state_lookup_byaddr lockless
  xfrm: check for PMTU in tunnel mode for packet offload
  xfrm: provide common xdo_dev_offload_ok callback implementation
  xfrm: rely on XFRM offload
  xfrm: simplify SA initialization routine
  xfrm: delay initialization of offload path till its actually requested
  xfrm: prevent high SEQ input in non-ESN mode
====================

Link: https://patch.msgid.link/20250324061855.4116819-1-steffen.klassert@secunet.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-25 08:50:10 -07:00
Jacob Keller 5eada2aabf igb: reject invalid external timestamp requests for 82580-based HW
The igb_ptp_feature_enable_82580 function correctly checks that unknown
flags are not passed to the function. However, it does not actually check
PTP_RISING_EDGE or PTP_FALLING_EDGE when configuring the external timestamp
function.

The data sheet for the 82580 product says:

  Upon a change in the input level of one of the SDP pins that was
  configured to detect Time stamp events using the TSSDP register, a time
  stamp of the system time is captured into one of the two auxiliary time
  stamp registers (AUXSTMPL/H0 or AUXSTMPL/H1).

  For example to define timestamping of events in the AUXSTMPL0 and
  AUXSTMPH0 registers, Software should:

  1. Set the TSSDP.AUX0_SDP_SEL field to select the SDP pin that detects
     the level change and set the TSSDP.AUX0_TS_SDP_EN bit to 1.

  2. Set the TSAUXC.EN_TS0 bit to 1 to enable timestamping

The same paragraph is in the i350 and i354 data sheets.

The wording implies that the time stamps are captured at any level change.
There does not appear to be any way to only timestamp one edge of the
signal.

Reject requests which do not set both PTP_RISING_EDGE and PTP_FALLING_EDGE
when operating under PTP_STRICT_FLAGS mode via PTP_EXTTS_REQUEST2.

Fixes: 38970eac41 ("igb: support EXTTS on 82580/i354/i350")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250312-jk-net-fixes-supported-extts-flags-v2-1-ea930ba82459@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-20 08:50:02 +01:00
Emil Tantilov 680811c679 idpf: check error for register_netdev() on init
Current init logic ignores the error code from register_netdev(),
which will cause WARN_ON() on attempt to unregister it, if there was one,
and there is no info for the user that the creation of the netdev failed.

WARNING: CPU: 89 PID: 6902 at net/core/dev.c:11512 unregister_netdevice_many_notify+0x211/0x1a10
...
[ 3707.563641]  unregister_netdev+0x1c/0x30
[ 3707.563656]  idpf_vport_dealloc+0x5cf/0xce0 [idpf]
[ 3707.563684]  idpf_deinit_task+0xef/0x160 [idpf]
[ 3707.563712]  idpf_vc_core_deinit+0x84/0x320 [idpf]
[ 3707.563739]  idpf_remove+0xbf/0x780 [idpf]
[ 3707.563769]  pci_device_remove+0xab/0x1e0
[ 3707.563786]  device_release_driver_internal+0x371/0x530
[ 3707.563803]  driver_detach+0xbf/0x180
[ 3707.563816]  bus_remove_driver+0x11b/0x2a0
[ 3707.563829]  pci_unregister_driver+0x2a/0x250

Introduce an error check and log the vport number and error code.
On removal make sure to check VPORT_REG_NETDEV flag prior to calling
unregister and free on the netdev.

Add local variables for idx, vport_config and netdev for readability.

Fixes: 0fe45467a1 ("idpf: add create vport and netdev configuration")
Suggested-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Mateusz Polchlopek 1388dd5641 ice: fix using untrusted value of pkt_len in ice_vc_fdir_parse_raw()
Fix using the untrusted value of proto->raw.pkt_len in function
ice_vc_fdir_parse_raw() by verifying if it does not exceed the
VIRTCHNL_MAX_SIZE_RAW_PACKET value.

Fixes: 99f419df8a ("ice: enable FDIR filters from raw binary patterns for VFs")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Lukasz Czapnik c5be6562de ice: fix input validation for virtchnl BW
Add missing validation of tc and queue id values sent by a VF in
ice_vc_cfg_q_bw().
Additionally fixed logged value in the warning message,
where max_tx_rate was incorrectly referenced instead of min_tx_rate.
Also correct error handling in this function by properly exiting
when invalid configuration is detected.

Fixes: 015307754a ("ice: Support VF queue rate limit and quanta size configuration")
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Lukasz Czapnik <lukasz.czapnik@intel.com>
Co-developed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jan Glaza e2f7d3f733 ice: validate queue quanta parameters to prevent OOB access
Add queue wraparound prevention in quanta configuration.
Ensure end_qid does not overflow by validating start_qid and num_queues.

Fixes: 015307754a ("ice: Support VF queue rate limit and quanta size configuration")
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jan Glaza <jan.glaza@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jan Glaza f91d0efcc3 ice: stop truncating queue ids when checking
Queue IDs can be up to 4096, fix invalid check to stop
truncating IDs to 8 bits.

Fixes: bf93bf791c ("ice: introduce ice_virtchnl.c and ice_virtchnl.h")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jan Glaza <jan.glaza@intel.com>
Signed-off-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:16 -07:00
Jesse Brandeburg 7fd71f3172 ice: fix reservation of resources for RDMA when disabled
If the CONFIG_INFINIBAND_IRDMA symbol is not enabled as a module or a
built-in, then don't let the driver reserve resources for RDMA. The result
of this change is a large savings in resources for older kernels, and a
cleaner driver configuration for the IRDMA=n case for old and new kernels.

Implement this by avoiding enabling the RDMA capability when scanning
hardware capabilities.

Note: Loading the out-of-tree irdma driver in connection to the in-kernel
ice driver, is not supported, and should not be attempted, especially when
disabling IRDMA in the kernel config.

Fixes: d25a0fc41c ("ice: Initialize RDMA support")
Signed-off-by: Jesse Brandeburg <jbrandeburg@cloudflare.com>
Acked-by: Dave Ertman <david.m.ertman@intel.com>
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Karol Kolacinski 53ce7166cb ice: ensure periodic output start time is in the future
On E800 series hardware, if the start time for a periodic output signal is
programmed into GLTSYN_TGT_H and GLTSYN_TGT_L registers, the hardware logic
locks up and the periodic output signal never starts. Any future attempt to
reprogram the clock function is futile as the hardware will not reset until
a power on.

The ice_ptp_cfg_perout function has logic to prevent this, as it checks if
the requested start time is in the past. If so, a new start time is
calculated by rounding up.

Since commit d755a7e129 ("ice: Cache perout/extts requests and check
flags"), the rounding is done to the nearest multiple of the clock period,
rather than to a full second. This is more accurate, since it ensures the
signal matches the user request precisely.

Unfortunately, there is a race condition with this rounding logic. If the
current time is close to the multiple of the period, we could calculate a
target time that is extremely soon. It takes time for the software to
program the registers, during which time this requested start time could
become a start time in the past. If that happens, the periodic output
signal will lock up.

For large enough periods, or for the logic prior to the mentioned commit,
this is unlikely. However, with the new logic rounding to the period and
with a small enough period, this becomes inevitable.

For example, attempting to enable a 10MHz signal requires a period of 100
nanoseconds. This means in the *best* case, we have 99 nanoseconds to
program the clock output. This is essentially impossible, and thus such a
small period practically guarantees that the clock output function will
lock up.

To fix this, add some slop to the clock time used to check if the start
time is in the past. Because it is not critical that output signals start
immediately, but it *is* critical that we do not brick the function, 0.5
seconds is selected. This does mean that any requested output will be
delayed by at least 0.5 seconds.

This slop is applied before rounding, so that we always round up to the
nearest multiple of the period that is at least 0.5 seconds in the future,
ensuring a minimum of 0.5 seconds to program the clock output registers.

Finally, to ensure that the hardware registers programming the clock output
complete in a timely manner, add a write flush to the end of
ice_ptp_write_perout. This ensures we don't risk any issue with PCIe
transaction batching.

Strictly speaking, this fixes a race condition all the way back at the
initial implementation of periodic output programming, as it is
theoretically possible to trigger this bug even on the old logic when
always rounding to a full second. However, the window is narrow, and the
code has been refactored heavily since then, making a direct backport not
apply cleanly.

Fixes: d755a7e129 ("ice: Cache perout/extts requests and check flags")
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Przemek Kitszel fa8eda1901 ice: health.c: fix compilation on gcc 7.5
GCC 7 is not as good as GCC 8+ in telling what is a compile-time
const, and thus could be used for static storage.
Fortunately keeping strings as const arrays is enough to make old
gcc happy.

Excerpt from the report:
My GCC is: gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0.

  CC [M]  drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.o
drivers/net/ethernet/intel/ice/devlink/health.c:35:3: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
   ^~~~~~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:35:3: note: (near initialization for 'ice_health_status_lookup[0].solution')
drivers/net/ethernet/intel/ice/devlink/health.c:35:31: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
                               ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:35:31: note: (near initialization for 'ice_health_status_lookup[0].data_label[0]')
drivers/net/ethernet/intel/ice/devlink/health.c:37:46: error: initializer element is not constant
   "Change or replace the module or cable.", {ice_port_number_label}},
                                              ^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/ice/devlink/health.c:37:46: note: (near initialization for 'ice_health_status_lookup[1].data_label[0]')
drivers/net/ethernet/intel/ice/devlink/health.c:39:3: error: initializer element is not constant
   ice_common_port_solutions, {ice_port_number_label}},
   ^~~~~~~~~~~~~~~~~~~~~~~~~

Fixes: 85d6164ec5 ("ice: add fw and port health reporters")
Reported-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Closes: https://lore.kernel.org/netdev/CY8PR11MB7134BF7A46D71E50D25FA7A989F72@CY8PR11MB7134.namprd11.prod.outlook.com
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Suggested-by: Simon Horman <horms@kernel.org>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-18 09:38:15 -07:00
Jedrzej Jagielski affead2d90 ixgbe: add support for thermal sensor event reception
E610 NICs unlike the previous devices utilising ixgbe driver
are notified in the case of overheating by the FW ACI event.

In event of overheat when threshold is exceeded, FW suspends all
traffic and sends overtemp event to the driver. Then driver
logs appropriate message and disables the adapter instance.
The card remains in that state until the platform is rebooted.

This approach is a solution to the fact current version of the
E610 FW doesn't support reading thermal sensor data by the
SW. So give to user at least any info that overtemp event
has occurred, without interface disappearing from the OS
without any note.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Jeremiah Lokan <jeremiahx.j.lokan@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-7-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:50 +01:00
Piotr Kwapulinski 18a9b8e358 ixgbe: add PTP support for E610 device
Add PTP support for E610 adapter. The E610 is based on X550 and adds
firmware managed link, enhanced security capabilities and support for
updated server manageability. It does not introduce any new PTP features
compared to X550.

Reviewed-by: Milena Olech <milena.olech@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-6-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Karol Kolacinski 50f4ffac91 ice: E825C PHY register cleanup
Minor PTP register refactor, including logical grouping E825C 1-step
timestamping registers. Remove unused register definitions
(PHY_REG_GPCS_BITSLIP, PHY_REG_REVISION).
Also, apply preferred GENMASK macro (instead of ICE_M) for register
fields definition affected by this patch.

Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-5-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Karol Kolacinski 66a1b7e09f ice: Refactor E825C PHY registers info struct
Simplify ice_phy_reg_info_eth56g struct definition to include base
address for the very first quad. Use base address info and 'step'
value to determine address for specific PHY quad.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-4-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Karol Kolacinski 178edd2633 ice: rename ice_ptp_init_phc_eth56g function
Refactor the code by changing ice_ptp_init_phc_eth56g function
name to ice_ptp_init_phc_e825, to be consistent with the naming pattern
for other devices.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-3-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Paul Greenwalt 905d1a220e ice: Add E830 checksum offload support
E830 supports raw receive and generic transmit checksum offloads.

Raw receive checksum support is provided by hardware calculating the
checksum over the whole packet, regardless of type. The calculated
checksum is provided to driver in the Rx flex descriptor. Then the driver
assigns the checksum to skb->csum and sets skb->ip_summed to
CHECKSUM_COMPLETE.

Generic transmit checksum support is provided by hardware calculating the
checksum given two offsets: the start offset to begin checksum calculation,
and the offset to insert the calculated checksum in the packet. Support is
advertised to the stack using NETIF_F_HW_CSUM feature.

E830 has the following limitations when both generic transmit checksum
offload and TCP Segmentation Offload (TSO) are enabled:

1. Inner packet header modification is not supported. This restriction
   includes the inability to alter TCP flags, such as the push flag. As a
   result, this limitation can impact the receiver's ability to coalesce
   packets, potentially degrading network throughput.
2. The Maximum Segment Size (MSS) is limited to 1023 bytes, which prevents
   support of Maximum Transmission Unit (MTU) greater than 1063 bytes.

Therefore NETIF_F_HW_CSUM and NETIF_F_ALL_TSO features are mutually
exclusive. NETIF_F_HW_CSUM hardware feature support is indicated but is not
enabled by default. Instead, IP checksums and NETIF_F_ALL_TSO are the
defaults. Enforcement of mutual exclusivity of NETIF_F_HW_CSUM and
NETIF_F_ALL_TSO is done in ice_set_features(). Mutual exclusivity
of IP checksums and NETIF_F_HW_CSUM is handled by netdev_fix_features().

When NETIF_F_HW_CSUM is requested the provided skb->csum_start and
skb->csum_offset are passed to hardware in the Tx context descriptor
generic checksum (GCS) parameters. Hardware calculates the 1's complement
from skb->csum_start to the end of the packet, and inserts the result in
the packet at skb->csum_offset.

Co-developed-by: Alice Michael <alice.michael@intel.com>
Signed-off-by: Alice Michael <alice.michael@intel.com>
Co-developed-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Eric Joyner <eric.joyner@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250310174502.3708121-2-anthony.l.nguyen@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-18 10:15:49 +01:00
Paolo Abeni 941defcea7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc6).

Conflicts:

tools/testing/selftests/drivers/net/ping.py
  75cc19c8ff ("selftests: drv-net: add xdp cases for ping.py")
  de94e86974 ("selftests: drv-net: store addresses in dict indexed by ipver")
https://lore.kernel.org/netdev/20250311115758.17a1d414@canb.auug.org.au/

net/core/devmem.c
  a70f891e0f ("net: devmem: do not WARN conditionally after netdev_rx_queue_restart()")
  1d22d3060b ("net: drop rtnl_lock for queue_mgmt operations")
https://lore.kernel.org/netdev/20250313114929.43744df1@canb.auug.org.au/

Adjacent changes:

tools/testing/selftests/net/Makefile
  6f50175cca ("selftests: Add IPv6 link-local address generation tests for GRE devices.")
  2e5584e0f9 ("selftests/net: expand cmsg_ipv6.sh with ipv4")

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  661958552e ("eth: bnxt: do not use BNXT_VNIC_NTUPLE unconditionally in queue restart logic")
  fe96d717d3 ("bnxt_en: Extend queue stop/start for TX rings")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-03-13 23:08:11 +01:00
Jakub Kicinski 8ef890df40 net: move misc netdev_lock flavors to a separate header
Move the more esoteric helpers for netdev instance lock to
a dedicated header. This avoids growing netdevice.h to infinity
and makes rebuilding the kernel much faster (after touching
the header with the helpers).

The main netdev_lock() / netdev_unlock() functions are used
in static inlines in netdevice.h and will probably be used
most commonly, so keep them in netdevice.h.

Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250307183006.2312761-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-08 09:06:50 -08:00
Stanislav Fomichev c4f0f30b42 net: hold netdev instance lock during nft ndo_setup_tc
Introduce new dev_setup_tc for nft ndo_setup_tc paths.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-3-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 12:59:43 -08:00
Stanislav Fomichev d4c22ec680 net: hold netdev instance lock during ndo_open/ndo_stop
For the drivers that use shaper API, switch to the mode where
core stack holds the netdev lock. This affects two drivers:

* iavf - already grabs netdev lock in ndo_open/ndo_stop, so mostly
         remove these
* netdevsim - switch to _locked APIs to avoid deadlock

iavf_close diff is a bit confusing, the existing call looks like this:
  iavf_close() {
    netdev_lock()
    ..
    netdev_unlock()
    wait_event_timeout(down_waitqueue)
  }

I change it to the following:
  netdev_lock()
  iavf_close() {
    ..
    netdev_unlock()
    wait_event_timeout(down_waitqueue)
    netdev_lock() // reusing this lock call
  }
  netdev_unlock()

Since I'm reusing existing netdev_lock call, so it looks like I only
add netdev_unlock.

Cc: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250305163732.2766420-2-sdf@fomichev.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-03-06 12:59:43 -08:00
Przemek Kitszel 2a3e89a148 ice: register devlink prior to creating health reporters
ice_health_init() was introduced in the commit 2a82874a3b ("ice: add
Tx hang devlink health reporter"). The call to it should have been put
after ice_devlink_register(). It went unnoticed until next reporter by
Konrad, which receives events from FW. FW is reporting all events, also
from prior driver load, and thus it is not unlikely to have something
at the very beginning. And that results in a splat:
[   24.455950]  ? devlink_recover_notify.constprop.0+0x198/0x1b0
[   24.455973]  devlink_health_report+0x5d/0x2a0
[   24.455976]  ? __pfx_ice_health_status_lookup_compare+0x10/0x10 [ice]
[   24.456044]  ice_process_health_status_event+0x1b7/0x200 [ice]

Do the analogous thing for deinit patch.

Fixes: 85d6164ec5 ("ice: add fw and port health reporters")
Reviewed-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Konrad Knitter <konrad.knitter@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:45:45 -08:00
Marcin Szycik dce97cb0a3 ice: Fix switchdev slow-path in LAG
Ever since removing switchdev control VSI and using PF for port
representor Tx/Rx, switchdev slow-path has been working improperly after
failover in SR-IOV LAG. LAG assumes that the first uplink to be added to
the aggregate will own VFs and have switchdev configured. After
failing-over to the other uplink, representors are still configured to
Tx through the uplink they are set up on, which fails because that
uplink is now down.

On failover, update all PRs on primary uplink to use the currently
active uplink for Tx. Call netif_keep_dst(), as the secondary uplink
might not be in switchdev mode. Also make sure to call
ice_eswitch_set_target_vsi() if uplink is in LAG.

On the Rx path, representors are already working properly, because
default Tx from VFs is set to PF owning the eswitch. After failover the
same PF is receiving traffic from VFs, even though link is down.

Fixes: defd52455a ("ice: do Tx through PF netdev in slow-path")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:43:05 -08:00
Grzegorz Nitka 23d97f1890 ice: fix memory leak in aRFS after reset
Fix aRFS (accelerated Receive Flow Steering) structures memory leak by
adding a checker to verify if aRFS memory is already allocated while
configuring VSI. aRFS objects are allocated in two cases:
- as part of VSI initialization (at probe), and
- as part of reset handling

However, VSI reconfiguration executed during reset involves memory
allocation one more time, without prior releasing already allocated
resources. This led to the memory leak with the following signature:

[root@os-delivery ~]# cat /sys/kernel/debug/kmemleak
unreferenced object 0xff3c1ca7252e6000 (size 8192):
  comm "kworker/0:0", pid 8, jiffies 4296833052
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace (crc 0):
    [<ffffffff991ec485>] __kmalloc_cache_noprof+0x275/0x340
    [<ffffffffc0a6e06a>] ice_init_arfs+0x3a/0xe0 [ice]
    [<ffffffffc09f1027>] ice_vsi_cfg_def+0x607/0x850 [ice]
    [<ffffffffc09f244b>] ice_vsi_setup+0x5b/0x130 [ice]
    [<ffffffffc09c2131>] ice_init+0x1c1/0x460 [ice]
    [<ffffffffc09c64af>] ice_probe+0x2af/0x520 [ice]
    [<ffffffff994fbcd3>] local_pci_probe+0x43/0xa0
    [<ffffffff98f07103>] work_for_cpu_fn+0x13/0x20
    [<ffffffff98f0b6d9>] process_one_work+0x179/0x390
    [<ffffffff98f0c1e9>] worker_thread+0x239/0x340
    [<ffffffff98f14abc>] kthread+0xcc/0x100
    [<ffffffff98e45a6d>] ret_from_fork+0x2d/0x50
    [<ffffffff98e083ba>] ret_from_fork_asm+0x1a/0x30
    ...

Fixes: 28bf26724f ("ice: Implement aRFS")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:40:41 -08:00
Larysa Zaremba 3be83ee9de ice: do not configure destination override for switchdev
After switchdev is enabled and disabled later, LLDP packets sending stops,
despite working perfectly fine before and during switchdev state.
To reproduce (creating/destroying VF is what triggers the reconfiguration):

devlink dev eswitch set pci/<address> mode switchdev
echo '2' > /sys/class/net/<ifname>/device/sriov_numvfs
echo '0' > /sys/class/net/<ifname>/device/sriov_numvfs

This happens because LLDP relies on the destination override functionality.
It needs to 1) set a flag in the descriptor, 2) set the VSI permission to
make it valid. The permissions are set when the PF VSI is first configured,
but switchdev then enables it for the uplink VSI (which is always the PF)
once more when configured and disables when deconfigured, which leads to
software-generated LLDP packets being blocked.

Do not modify the destination override permissions when configuring
switchdev, as the enabled state is the default configuration that is never
modified.

Fixes: 1a1c40df2e ("ice: set and release switchdev environment")
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-03-05 08:30:26 -08:00
Gal Pressman e0c032d26d ice: dpll: Remove newline at the end of a netlink error message
Netlink error messages should not have a newline at the end of the
string.

Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250226093904.6632-6-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 18:11:38 -08:00
Jakub Kicinski 357660d759 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc5).

Conflicts:

drivers/net/ethernet/cadence/macb_main.c
  fa52f15c74 ("net: cadence: macb: Synchronize stats calculations")
  75696dd0fd ("net: cadence: macb: Convert to get_stats64")
https://lore.kernel.org/20250224125848.68ee63e5@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/intel/ice/ice_sriov.c
  79990cf5e7 ("ice: Fix deinitializing VF in error path")
  a203163274 ("ice: simplify VF MSI-X managing")

net/ipv4/tcp.c
  18912c5206 ("tcp: devmem: don't write truncated dmabuf CMSGs to userspace")
  297d389e9e ("net: prefix devmem specific helpers")

net/mptcp/subflow.c
  8668860b0a ("mptcp: reset when MPTCP opts are dropped after join")
  c3349a22c2 ("mptcp: consolidate subflow cleanup")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 10:20:58 -08:00
Eric Dumazet 674fcb4f4a idpf: fix checksums set in idpf_rx_rsc()
idpf_rx_rsc() uses skb_transport_offset(skb) while the transport header
is not set yet.

This triggers the following warning for CONFIG_DEBUG_NET=y builds.

DEBUG_NET_WARN_ON_ONCE(!skb_transport_header_was_set(skb))

[   69.261620] WARNING: CPU: 7 PID: 0 at ./include/linux/skbuff.h:3020 idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf
[   69.261629] Modules linked in: vfat fat dummy bridge intel_uncore_frequency_tpmi intel_uncore_frequency_common intel_vsec_tpmi idpf intel_vsec cdc_ncm cdc_eem cdc_ether usbnet mii xhci_pci xhci_hcd ehci_pci ehci_hcd libeth
[   69.261644] CPU: 7 UID: 0 PID: 0 Comm: swapper/7 Tainted: G S      W          6.14.0-smp-DEV #1697
[   69.261648] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN
[   69.261650] RIP: 0010:idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf
[   69.261677] ? __warn (kernel/panic.c:242 kernel/panic.c:748)
[   69.261682] ? idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf
[   69.261687] ? report_bug (lib/bug.c:?)
[   69.261690] ? handle_bug (arch/x86/kernel/traps.c:285)
[   69.261694] ? exc_invalid_op (arch/x86/kernel/traps.c:309)
[   69.261697] ? asm_exc_invalid_op (arch/x86/include/asm/idtentry.h:621)
[   69.261700] ? __pfx_idpf_vport_splitq_napi_poll (drivers/net/ethernet/intel/idpf/idpf_txrx.c:4011) idpf
[   69.261704] ? idpf_vport_splitq_napi_poll (include/linux/skbuff.h:3020) idpf
[   69.261708] ? idpf_vport_splitq_napi_poll (drivers/net/ethernet/intel/idpf/idpf_txrx.c:3072) idpf
[   69.261712] __napi_poll (net/core/dev.c:7194)
[   69.261716] net_rx_action (net/core/dev.c:7265)
[   69.261718] ? __qdisc_run (net/sched/sch_generic.c:293)
[   69.261721] ? sched_clock (arch/x86/include/asm/preempt.h:84 arch/x86/kernel/tsc.c:288)
[   69.261726] handle_softirqs (kernel/softirq.c:561)

Fixes: 3a8845af66 ("idpf: add RX splitq napi poll support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Alan Brady <alan.brady@intel.com>
Cc: Joshua Hay <joshua.a.hay@intel.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250226221253.1927782-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-27 07:31:41 -08:00
Ahmed Zaki deab38f8f0 idpf: use napi's irq affinity
Delete the driver CPU affinity info and use the core's napi config
instead.

Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://patch.msgid.link/20250224232228.990783-6-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:51:38 -08:00
Ahmed Zaki 4063af2967 ice: use napi's irq affinity and rmap IRQ notifiers
Delete the driver CPU affinity and aRFS rmap info, use the core's
API instead.

Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://patch.msgid.link/20250224232228.990783-5-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:51:37 -08:00
Ahmed Zaki 30b78ba3d4 ice: clear NAPI's IRQ numbers in ice_vsi_clear_napi_queues()
We set the NAPI's IRQ number in ice_vsi_set_napi_queues(). Clear the
NAPI's IRQ in ice_vsi_clear_napi_queues().

Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Link: https://patch.msgid.link/20250224232228.990783-4-ahmed.zaki@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:51:37 -08:00
Willem de Bruijn e6116fc605 net: skb: free up one bit in tx_flags
The linked series wants to add skb tx completion timestamps.
That needs a bit in skb_shared_info.tx_flags, but all are in use.

A per-skb bit is only needed for features that are configured on a
per packet basis. Per socket features can be read from sk->sk_tsflags.

Per packet tsflags can be set in sendmsg using cmsg, but only those in
SOF_TIMESTAMPING_TX_RECORD_MASK.

Per packet tsflags can also be set without cmsg by sandwiching a
send inbetween two setsockopts:

    val |= SOF_TIMESTAMPING_$FEATURE;
    setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
    write(fd, buf, sz);
    val &= ~SOF_TIMESTAMPING_$FEATURE;
    setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));

Changing a datapath test from skb_shinfo(skb)->tx_flags to
skb->sk->sk_tsflags can change behavior in that case, as the tx_flags
is written before the second setsockopt updates sk_tsflags.

Therefore, only bits can be reclaimed that cannot be set by cmsg and
are also highly unlikely to be used to target individual packets
otherwise.

Free up the bit currently used for SKBTX_HW_TSTAMP_USE_CYCLES. This
selects between clock and free running counter source for HW TX
timestamps. It is probable that all packets of the same socket will
always use the same source.

Link: https://lore.kernel.org/netdev/cover.1739988644.git.pav@iki.fi/
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Link: https://patch.msgid.link/20250225023416.2088705-1-willemdebruijn.kernel@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-26 19:01:08 -08:00
Piotr Kwapulinski b1e44b4aec ixgbe: fix media cage present detection for E610 device
The commit 23c0e5a16b ("ixgbe: Add link management support for E610
device") introduced incorrect checking of media cage presence for E610
device. Fix it.

Fixes: 23c0e5a16b ("ixgbe: Add link management support for E610 device")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/e7d73b32-f12a-49d1-8b60-1ef83359ec13@stanley.mountain/
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-6-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:40 -08:00
Jacob Keller c6124f6fd3 iavf: fix circular lock dependency with netdev_lock
We have recently seen reports of lockdep circular lock dependency warnings
when loading the iAVF driver:

[ 1504.790308] ======================================================
[ 1504.790309] WARNING: possible circular locking dependency detected
[ 1504.790310] 6.13.0 #net_next_rt.c2933b2befe2.el9 Not tainted
[ 1504.790311] ------------------------------------------------------
[ 1504.790312] kworker/u128:0/13566 is trying to acquire lock:
[ 1504.790313] ffff97d0e4738f18 (&dev->lock){+.+.}-{4:4}, at: register_netdevice+0x52c/0x710
[ 1504.790320]
[ 1504.790320] but task is already holding lock:
[ 1504.790321] ffff97d0e47392e8 (&adapter->crit_lock){+.+.}-{4:4}, at: iavf_finish_config+0x37/0x240 [iavf]
[ 1504.790330]
[ 1504.790330] which lock already depends on the new lock.
[ 1504.790330]
[ 1504.790330]
[ 1504.790330] the existing dependency chain (in reverse order) is:
[ 1504.790331]
[ 1504.790331] -> #1 (&adapter->crit_lock){+.+.}-{4:4}:
[ 1504.790333]        __lock_acquire+0x52d/0xbb0
[ 1504.790337]        lock_acquire+0xd9/0x330
[ 1504.790338]        mutex_lock_nested+0x4b/0xb0
[ 1504.790341]        iavf_finish_config+0x37/0x240 [iavf]
[ 1504.790347]        process_one_work+0x248/0x6d0
[ 1504.790350]        worker_thread+0x18d/0x330
[ 1504.790352]        kthread+0x10e/0x250
[ 1504.790354]        ret_from_fork+0x30/0x50
[ 1504.790357]        ret_from_fork_asm+0x1a/0x30
[ 1504.790361]
[ 1504.790361] -> #0 (&dev->lock){+.+.}-{4:4}:
[ 1504.790364]        check_prev_add+0xf1/0xce0
[ 1504.790366]        validate_chain+0x46a/0x570
[ 1504.790368]        __lock_acquire+0x52d/0xbb0
[ 1504.790370]        lock_acquire+0xd9/0x330
[ 1504.790371]        mutex_lock_nested+0x4b/0xb0
[ 1504.790372]        register_netdevice+0x52c/0x710
[ 1504.790374]        iavf_finish_config+0xfa/0x240 [iavf]
[ 1504.790379]        process_one_work+0x248/0x6d0
[ 1504.790381]        worker_thread+0x18d/0x330
[ 1504.790383]        kthread+0x10e/0x250
[ 1504.790385]        ret_from_fork+0x30/0x50
[ 1504.790387]        ret_from_fork_asm+0x1a/0x30
[ 1504.790389]
[ 1504.790389] other info that might help us debug this:
[ 1504.790389]
[ 1504.790389]  Possible unsafe locking scenario:
[ 1504.790389]
[ 1504.790390]        CPU0                    CPU1
[ 1504.790391]        ----                    ----
[ 1504.790391]   lock(&adapter->crit_lock);
[ 1504.790393]                                lock(&dev->lock);
[ 1504.790394]                                lock(&adapter->crit_lock);
[ 1504.790395]   lock(&dev->lock);
[ 1504.790397]
[ 1504.790397]  *** DEADLOCK ***

This appears to be caused by the change in commit 5fda3f3534 ("net: make
netdev_lock() protect netdev->reg_state"), which added a netdev_lock() in
register_netdevice.

The iAVF driver calls register_netdevice() from iavf_finish_config(), as a
final stage of its state machine post-probe. It currently takes the RTNL
lock, then the netdev lock, and then the device critical lock. This pattern
is used throughout the driver. Thus there is a strong dependency that the
crit_lock should not be acquired before the net device lock. The change to
register_netdevice creates an ABBA lock order violation because the iAVF
driver is holding the crit_lock while calling register_netdevice, which
then takes the netdev_lock.

It seems likely that future refactors could result in netdev APIs which
hold the netdev_lock while calling into the driver. This means that we
should not re-order the locks so that netdev_lock is acquired after the
device private crit_lock.

Instead, notice that we already release the netdev_lock prior to calling
the register_netdevice. This flow only happens during the early driver
initialization as we transition through the __IAVF_STARTUP,
__IAVF_INIT_VERSION_CHECK, __IAVF_INIT_GET_RESOURCES, etc.

Analyzing the places where we take crit_lock in the driver there are two
sources:

a) several of the work queue tasks including adminq_task, watchdog_task,
reset_task, and the finish_config task.

b) various callbacks which ultimately stem back to .ndo operations or
ethtool operations.

The latter cannot be triggered until after the netdevice registration is
completed successfully.

The iAVF driver uses alloc_ordered_workqueue, which is an unbound workqueue
that has a max limit of 1, and thus guarantees that only a single work item
on the queue is executing at any given time, so none of the other work
threads could be executing due to the ordered workqueue guarantees.

The iavf_finish_config() function also does not do anything else after
register_netdevice, unless it fails. It seems unlikely that the driver
private crit_lock is protecting anything that register_netdevice() itself
touches.

Thus, to fix this ABBA lock violation, lets simply release the
adapter->crit_lock as well as netdev_lock prior to calling
register_netdevice(). We do still keep holding the RTNL lock as required by
the function. If we do fail to register the netdevice, then we re-acquire
the adapter critical lock to finish the transition back to
__IAVF_INIT_CONFIG_ADAPTER.

This ensures every call where both netdev_lock and the adapter->crit_lock
are acquired under the same ordering.

Fixes: afc664987a ("eth: iavf: extend the netdev_lock usage")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:40 -08:00
Marcin Szycik 5c07be96d8 ice: Avoid setting default Rx VSI twice in switchdev setup
As part of switchdev environment setup, uplink VSI is configured as
default for both Tx and Rx. Default Rx VSI is also used by promiscuous
mode. If promisc mode is enabled and an attempt to enter switchdev mode
is made, the setup will fail because Rx VSI is already configured as
default (rule exists).

Reproducer:
  devlink dev eswitch set $PF1_PCI mode switchdev
  ip l s $PF1 up
  ip l s $PF1 promisc on
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs

In switchdev setup, use ice_set_dflt_vsi() instead of plain
ice_cfg_dflt_vsi(), which avoids repeating setting default VSI for Rx if
it's already configured.

Fixes: 50d62022f4 ("ice: default Tx rule instead of to queue")
Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com
Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:36 -08:00
Marcin Szycik 79990cf5e7 ice: Fix deinitializing VF in error path
If ice_ena_vfs() fails after calling ice_create_vf_entries(), it frees
all VFs without removing them from snapshot PF-VF mailbox list, leading
to list corruption.

Reproducer:
  devlink dev eswitch set $PF1_PCI mode switchdev
  ip l s $PF1 up
  ip l s $PF1 promisc on
  sleep 1
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs
  sleep 1
  echo 1 > /sys/class/net/$PF1/device/sriov_numvfs

Trace (minimized):
  list_add corruption. next->prev should be prev (ffff8882e241c6f0), but was 0000000000000000. (next=ffff888455da1330).
  kernel BUG at lib/list_debug.c:29!
  RIP: 0010:__list_add_valid_or_report+0xa6/0x100
   ice_mbx_init_vf_info+0xa7/0x180 [ice]
   ice_initialize_vf_entry+0x1fa/0x250 [ice]
   ice_sriov_configure+0x8d7/0x1520 [ice]
   ? __percpu_ref_switch_mode+0x1b1/0x5d0
   ? __pfx_ice_sriov_configure+0x10/0x10 [ice]

Sometimes a KASAN report can be seen instead with a similar stack trace:
  BUG: KASAN: use-after-free in __list_add_valid_or_report+0xf1/0x100

VFs are added to this list in ice_mbx_init_vf_info(), but only removed
in ice_free_vfs(). Move the removing to ice_free_vf_entries(), which is
also being called in other places where VFs are being removed (including
ice_free_vfs() itself).

Fixes: 8cd8a6b17d ("ice: move VF overflow message count into struct ice_mbx_vf_info")
Reported-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Closes: https://lore.kernel.org/intel-wired-lan/PH0PR11MB50138B635F2E5CEB7075325D961F2@PH0PR11MB5013.namprd11.prod.outlook.com
Reviewed-by: Martyna Szapar-Mudlaw <martyna.szapar-mudlaw@linux.intel.com>
Signed-off-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250224190647.3601930-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 19:09:36 -08:00
Gal Pressman ecdff89338 ethtool: Symmetric OR-XOR RSS hash
Add an additional type of symmetric RSS hash type: OR-XOR.
The "Symmetric-OR-XOR" algorithm transforms the input as follows:

(SRC_IP | DST_IP, SRC_IP ^ DST_IP, SRC_PORT | DST_PORT, SRC_PORT ^ DST_PORT)

Change 'cap_rss_sym_xor_supported' to 'supported_input_xfrm', a bitmap
of supported RXH_XFRM_* types.

Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://patch.msgid.link/20250224174416.499070-2-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-25 18:31:04 -08:00
Jakub Kicinski e87700965a bpf-next-for-netdev
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQQ6NaUOruQGUkvPdG4raS+Z+3y5EwUCZ7ffOQAKCRAraS+Z+3y5
 EzVHAP9h/QkeYoOZW9gul08I8vFiZsFe/lbOSLJWxeVfxb9JhgD/cMqby3qAxQK6
 lsdNQ9jYG2232Wym89ag7fvTBK15Wg4=
 =gkN2
 -----END PGP SIGNATURE-----

Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Martin KaFai Lau says:

====================
pull-request: bpf-next 2025-02-20

We've added 19 non-merge commits during the last 8 day(s) which contain
a total of 35 files changed, 1126 insertions(+), 53 deletions(-).

The main changes are:

1) Add TCP_RTO_MAX_MS support to bpf_set/getsockopt, from Jason Xing

2) Add network TX timestamping support to BPF sock_ops, from Jason Xing

3) Add TX metadata Launch Time support, from Song Yoong Siang

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  igc: Add launch time support to XDP ZC
  igc: Refactor empty frame insertion for launch time support
  net: stmmac: Add launch time support to XDP ZC
  selftests/bpf: Add launch time request to xdp_hw_metadata
  xsk: Add launch time hardware offload support to XDP Tx metadata
  selftests/bpf: Add simple bpf tests in the tx path for timestamping feature
  bpf: Support selective sampling for bpf timestamping
  bpf: Add BPF_SOCK_OPS_TSTAMP_SENDMSG_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_ACK_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_HW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SND_SW_CB callback
  bpf: Add BPF_SOCK_OPS_TSTAMP_SCHED_CB callback
  net-timestamp: Prepare for isolating two modes of SO_TIMESTAMPING
  bpf: Disable unsafe helpers in TX timestamping callbacks
  bpf: Prevent unsafe access to the sock fields in the BPF timestamping callback
  bpf: Prepare the sock_ops ctx and call bpf prog for TX timestamping
  bpf: Add networking timestamping support to bpf_get/setsockopt()
  selftests/bpf: Add rto max for bpf_setsockopt test
  bpf: Support TCP_RTO_MAX_MS for bpf_setsockopt
====================

Link: https://patch.msgid.link/20250221022104.386462-1-martin.lau@linux.dev
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-21 15:59:47 -08:00
Leon Romanovsky cc18f482e8 xfrm: provide common xdo_dev_offload_ok callback implementation
Almost all drivers except bond and nsim had same check if device
can perform XFRM offload on that specific packet. The check was that
packet doesn't have IPv4 options and IPv6 extensions.

In NIC drivers, the IPv4 HELEN comparison was slightly different, but
the intent was to check for the same conditions. So let's chose more
strict variant as a common base.

Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2025-02-21 08:08:15 +01:00
Song Yoong Siang d7c3a7ff75 igc: Add launch time support to XDP ZC
Enable Launch Time Control (LTC) support for XDP zero copy via XDP Tx
metadata framework.

This patch has been tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel I225-LM Ethernet controller. Below are the test steps and result.

Test 1: Send a single packet with the launch time set to 1 s in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo ./xdp_hw_metadata enp2s0 -l 1000000000 -L 1

2. On the Link Partner, send a UDP packet with VLAN priority 1 to port 9091
   of the DUT.

Result:
When the launch time is set to 1 s in the future, the delta between the
launch time and the transmit hardware timestamp is 0.016 us, as shown in
printout of the xdp_hw_metadata application below.
  0x562ff5dc8880: rx_desc[4]->addr=84110 addr=84110 comp_addr=84110 EoP
  rx_hash: 0xE343384 with RSS type:0x1
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to User RX-time sec:0.0002 (183.103 usec)
  XDP RX-time:   1734578015467651698 (sec:1734578015.4677)
                 delta to User RX-time sec:0.0001 (80.309 usec)
  No rx_vlan_tci or rx_vlan_proto, err=-95
  0x562ff5dc8880: ping-pong with csum=561c (want c7dd)
                  csum_start=34 csum_offset=6
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to HW Launch-time sec:1.0000 (1000000.000 usec)
  0x562ff5dc8880: complete tx idx=4 addr=4018
  HW Launch-time:   1734578016467548904 (sec:1734578016.4675)
                    delta to HW TX-complete-time sec:0.0000 (0.016 usec)
  HW TX-complete-time:   1734578016467548920 (sec:1734578016.4675)
                         delta to User TX-complete-time sec:0.0000
                         (32.546 usec)
  XDP RX-time:   1734578015467651698 (sec:1734578015.4677)
                 delta to User TX-complete-time sec:0.9999
                 (999929.768 usec)
  HW RX-time:   1734578015467548904 (sec:1734578015.4675)
                delta to HW TX-complete-time sec:1.0000 (1000000.016 usec)
  0x562ff5dc8880: complete rx idx=132 addr=84110

Test 2: Send 1000 packets with a 10 ms interval and the launch time set to
        500 us in the future.

Test steps:
1. On the DUT, start the xdp_hw_metadata selftest application:
   $ sudo chrt -f 99 ./xdp_hw_metadata enp2s0 -l 500000 -L 1 > \
     /dev/shm/result.log

2. On the Link Partner, send 1000 UDP packets with a 10 ms interval and
   VLAN priority 1 to port 9091 of the DUT.

Result:
When the launch time is set to 500 us in the future, the average delta
between the launch time and the transmit hardware timestamp is 0.016 us,
as shown in the analysis of /dev/shm/result.log below. The XDP launch time
works correctly in sending 1000 packets continuously.
  Min delta: 0.005 us
  Avr delta: 0.016 us
  Max delta: 0.031 us
  Total packets forwarded: 1000

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250216093430.957880-6-yoong.siang.song@intel.com
2025-02-20 15:13:46 -08:00
Song Yoong Siang f9b53bb139 igc: Refactor empty frame insertion for launch time support
Refactor the code for inserting an empty frame into a new function
igc_insert_empty_frame(). This change extracts the logic for inserting
an empty packet from igc_xmit_frame_ring() into a separate function,
allowing it to be reused in future implementations, such as the XDP
zero copy transmit function.

Remove the igc_desc_unused() checking in igc_init_tx_empty_descriptor()
because the number of descriptors needed is guaranteed.

Ensure that skb allocation and DMA mapping work for the empty frame,
before proceeding to fill in igc_tx_buffer info, context descriptor,
and data descriptor.

Rate limit the error messages for skb allocation and DMA mapping failures.

Update the comment to indicate that the 2 descriptors needed by the empty
frame are already taken into consideration in igc_xmit_frame_ring().

Handle the case where the insertion of an empty frame fails and explain
the reason behind this handling.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Faizal Rahim <faizal.abdul.rahim@linux.intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250216093430.957880-5-yoong.siang.song@intel.com
2025-02-20 15:13:46 -08:00
Nam Cao e26ad10db8 igc: Switch to use hrtimer_setup()
hrtimer_setup() takes the callback function pointer as argument and
initializes the timer completely.

Replace hrtimer_init() and the open coded initialization of
hrtimer::function with the new setup mechanism.

Patch was created by using Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/baeaabb84aef1803dfae385f6e9614d6fc547667.1738746872.git.namcao@linutronix.de
2025-02-18 10:35:47 +01:00
Jakub Kicinski 0f375d90c4 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice, iavf: Add support for Rx timestamping

Mateusz Polchlopek says:

Initially, during VF creation it registers the PTP clock in
the system and negotiates with PF it's capabilities. In the
meantime the PF enables the Flexible Descriptor for VF.
Only this type of descriptor allows to receive Rx timestamps.

Enabling virtual clock would be possible, though it would probably
perform poorly due to the lack of direct time access.

Enable timestamping should be done using userspace tools, e.g.
hwstamp_ctl -i $VF -r 14

In order to report the timestamps to userspace, the VF extends
timestamp to 40b.

To support this feature the flexible descriptors and PTP part
in iavf driver have been introduced.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: add support for Rx timestamps to hotpath
  iavf: handle set and get timestamps ops
  iavf: Implement checking DD desc field
  iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors
  iavf: define Rx descriptors as qwords
  libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
  iavf: periodically cache PHC time
  iavf: add support for indirect access to PHC time
  iavf: add initial framework for registering PTP clock
  iavf: negotiate PTP capabilities
  iavf: add support for negotiating flexible RXDID format
  virtchnl: add enumeration for the rxdid format
  ice: support Rx timestamp on flex descriptor
  virtchnl: add support for enabling PTP on iAVF
====================

Link: https://patch.msgid.link/20250214192739.1175740-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-17 17:09:44 -08:00
Dan Carpenter c2ddb619fa ice: Fix signedness bug in ice_init_interrupt_scheme()
If pci_alloc_irq_vectors() can't allocate the minimum number of vectors
then it returns -ENOSPC so there is no need to check for that in the
caller.  In fact, because pf->msix.min is an unsigned int, it means that
any negative error codes are type promoted to high positive values and
treated as success.  So here, the "return -ENOMEM;" is unreachable code.
Check for negatives instead.

Now that we're only dealing with error codes, it's easier to propagate
the error code from pci_alloc_irq_vectors() instead of hardcoding
-ENOMEM.

Fixes: 79d97b8cf9 ("ice: remove splitting MSI-X between features")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/b16e4f01-4c85-46e2-b602-fce529293559@stanley.mountain
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-14 17:18:00 -08:00
Jacob Keller 48ccdcd87e iavf: add support for Rx timestamps to hotpath
Add support for receive timestamps to the Rx hotpath. This support only
works when using the flexible descriptor format, so make sure that we
request this format by default if we have receive timestamp support
available in the PTP capabilities.

In order to report the timestamps to userspace, we need to perform
timestamp extension. The Rx descriptor does actually contain the "40
bit" timestamp. However, upper 32 bits which contain nanoseconds are
conveniently stored separately in the descriptor. We could extract the
32bits and lower 8 bits, then perform a bitwise OR to calculate the
40bit value. This makes no sense, because the timestamp extension
algorithm would simply discard the lower 8 bits anyways.

Thus, implement timestamp extension as iavf_ptp_extend_32b_timestamp(),
and extract and forward only the 32bits of nominal nanoseconds.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Sunil Goutham <sgoutham@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Jacob Keller 51534239ef iavf: handle set and get timestamps ops
Add handlers for the .ndo_hwtstamp_get and .ndo_hwtstamp_set ops which
allow userspace to request timestamp enablement for the device. This
support allows standard Linux applications to request the timestamping
desired.

As with other devices that support timestamping all packets, the driver
will upgrade any request for timestamping of a specific type of packet
to HWTSTAMP_FILTER_ALL.

The current configuration is stored, so that it can be retrieved by
calling .ndo_hwtstamp_get

The Tx timestamps are not implemented yet so calling set ops for
Tx path will end with EOPNOTSUPP error code.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Mateusz Polchlopek 8447357e7b iavf: Implement checking DD desc field
Rx timestamping introduced in PF driver caused the need of refactoring
the VF driver mechanism to check packet fields.

The function to check errors in descriptor has been removed and from
now only previously set struct fields are being checked. The field DD
(descriptor done) needs to be checked at the very beginning, before
extracting other fields.

Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Jacob Keller 2dc8e7c36d iavf: refactor iavf_clean_rx_irq to support legacy and flex descriptors
Using VIRTCHNL_VF_OFFLOAD_FLEX_DESC, the iAVF driver is capable of
negotiating to enable the advanced flexible descriptor layout. Add the
flexible NIC layout (RXDID=2) as a member of the Rx descriptor union.

Also add bit position definitions for the status and error indications
that are needed.

The iavf_clean_rx_irq function needs to extract a few fields from the Rx
descriptor, including the size, rx_ptype, and vlan_tag.
Move the extraction to a separate function that decodes the fields into
a structure. This will reduce the burden for handling multiple
descriptor types by keeping the relevant extraction logic in one place.

To support handling an additional descriptor format with minimal code
duplication, refactor Rx checksum handling so that the general logic
is separated from the bit calculations. Introduce an iavf_rx_desc_decoded
structure which holds the relevant bits decoded from the Rx descriptor.
This will enable implementing flexible descriptor handling without
duplicating the general logic twice.

Introduce an iavf_extract_flex_rx_fields, iavf_flex_rx_hash, and
iavf_flex_rx_csum functions which operate on the flexible NIC descriptor
format instead of the legacy 32 byte format. Based on the negotiated
RXDID, select the correct function for processing the Rx descriptors.

With this change, the Rx hot path should be functional when using either
the default legacy 32byte format or when we switch to the flexible NIC
layout.

Modify the Rx hot path to add support for the flexible descriptor
format and add request enabling Rx timestamps for all queues.

As in ice, make sure we bump the checksum level if the hardware detected
a packet type which could have an outer checksum. This is important
because hardware only verifies the inner checksum.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Mateusz Polchlopek e9f476d7b3 iavf: define Rx descriptors as qwords
The union iavf_32byte_rx_desc consists of two unnamed structs defined
inside. One of them represents legacy 32 byte descriptor and second the
16 byte descriptor (extended to 32 byte). Each of them consists of
bunch of unions, structs and __le fields that represent specific fields
in descriptor.

This commit changes the representation of iavf_32byte_rx_desc union
to store four __le64 fields (qw0, qw1, qw2, qw3) that represent
quad-words. Those quad-words will be then accessed by calling
leXY_get_bits macros in upcoming commits.

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Mateusz Polchlopek ce5cf4af7c libeth: move idpf_rx_csum_decoded and idpf_rx_extracted
Structs idpf_rx_csum_decoded and idpf_rx_extracted are used both in
idpf and iavf Intel drivers. Change the prefix from idpf_* to libeth_*
and move mentioned structs to libeth's rx.h header file.

Adjust usage in idpf driver.

Suggested-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:08 -08:00
Jacob Keller 7c01dbfc8a iavf: periodically cache PHC time
The Rx timestamps reported by hardware may only have 32 bits of storage
for nanosecond time. These timestamps cannot be directly reported to the
Linux stack, as it expects 64bits of time.

To handle this, the timestamps must be extended using an algorithm that
calculates the corrected 64bit timestamp by comparison between the PHC
time and the timestamp. This algorithm requires the PHC time to be
captured within ~2 seconds of when the timestamp was captured.

Instead of trying to read the PHC time in the Rx hotpath, the algorithm
relies on a cached value that is periodically updated.

Keep this cached time up to date by using the PTP .do_aux_work kthread
function.

The iavf_ptp_do_aux_work will reschedule itself about twice a second,
and will check whether or not the cached PTP time needs to be updated.
If so, it issues a VIRTCHNL_OP_1588_PTP_GET_TIME to request the time
from the PF. The jitter and latency involved with this command aren't
important, because the cached time just needs to be kept up to date
within about ~2 seconds.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Jacob Keller 52e3beac76 iavf: add support for indirect access to PHC time
Implement support for reading the PHC time indirectly via the
VIRTCHNL_OP_1588_PTP_GET_TIME operation.

Based on some simple tests with ftrace, the latency of the indirect
clock access appears to be about ~110 microseconds. This is due to the
cost of preparing a message to send over the virtchnl queue.

This is expected, due to the increased jitter caused by sending messages
over virtchnl. It is not easy to control the precise time that the
message is sent by the VF, or the time that the message is responded to
by the PF, or the time that the message sent from the PF is received by
the VF.

For sending the request, note that many PTP related operations will
require sending of VIRTCHNL messages. Instead of adding a separate AQ
flag and storage for each operation, setup a simple queue mechanism for
queuing up virtchnl messages.

Each message will be converted to a iavf_ptp_aq_cmd structure which ends
with a flexible array member. A single AQ flag is added for processing
messages from this queue. In principle this could be extended to handle
arbitrary virtchnl messages. For now it is kept to PTP-specific as the
need is primarily for handling PTP-related commands.

Use this to implement .gettimex64 using the indirect method via the
virtchnl command. The response from the PF is processed and stored into
the cached_phc_time. A wait queue is used to allow the PTP clock gettime
request to sleep until the message is sent from the PF.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Jacob Keller d734223b2f iavf: add initial framework for registering PTP clock
Add the iavf_ptp.c file and fill it in with a skeleton framework to
allow registering the PTP clock device.
Add implementation of helper functions to check if a PTP capability
is supported and handle change in PTP capabilities.
Enabling virtual clock would be possible, though it would probably
perform poorly due to the lack of direct time access.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Sai Krishna <saikrishnag@marvell.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Co-developed-by: Ahmed Zaki <ahmed.zaki@intel.com>
Signed-off-by: Ahmed Zaki <ahmed.zaki@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Jacob Keller 3247d65ad9 iavf: negotiate PTP capabilities
Add a new extended capabilities negotiation to exchange information from
the PF about what PTP capabilities are supported by this VF. This
requires sending a VIRTCHNL_OP_1588_PTP_GET_CAPS message, and waiting
for the response from the PF. Handle this early on during the VF
initialization.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Jacob Keller 2a86e210f1 iavf: add support for negotiating flexible RXDID format
Enable support for VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC, to enable the VF
driver the ability to determine what Rx descriptor formats are
available. This requires sending an additional message during
initialization and reset, the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS. This
operation requests the supported Rx descriptor IDs available from the
PF.

This is treated the same way that VLAN V2 capabilities are handled. Add
a new set of extended capability flags, used to process send and receipt
of the VIRTCHNL_OP_GET_SUPPORTED_RXDIDS message.

This ensures we finish negotiating for the supported descriptor formats
prior to beginning configuration of receive queues.

This change stores the supported format bitmap into the iavf_adapter
structure. Additionally, if VIRTCHNL_VF_OFFLOAD_RX_FLEX_DESC is enabled
by the PF, we need to make sure that the Rx queue configuration
specifies the format.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:58:07 -08:00
Simei Su 7c1178a9df ice: support Rx timestamp on flex descriptor
To support Rx timestamp offload, VIRTCHNL_OP_1588_PTP_CAPS is sent by
the VF to request PTP capability and responded by the PF what capability
is enabled for that VF.

Hardware captures timestamps which contain only 32 bits of nominal
nanoseconds, as opposed to the 64bit timestamps that the stack expects.
To convert 32b to 64b, we need a current PHC time.
VIRTCHNL_OP_1588_PTP_GET_TIME is sent by the VF and responded by the
PF with the current PHC time.

Signed-off-by: Simei Su <simei.su@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-14 10:57:45 -08:00
Jakub Kicinski 7a7e019713 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.14-rc3).

No conflicts or adjacent changes.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-13 12:43:30 -08:00
Jakub Kicinski 7229fa6b30 Merge branch '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-02-11 (idpf, ixgbe, igc)

For idpf:

Sridhar fixes a couple issues in handling of RSC packets.

Josh adds a call to set_real_num_queues() to keep queue count in sync.

For ixgbe:

Piotr removes missed IS_ERR() removal when ERR_PTR usage was removed.

For igc:

Zdenek Bouska fixes reporting of Rx timestamp with AF_XDP.

Siang sets buffer type on empty frame to ensure proper handling.

* '200GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  igc: Set buffer type for empty frames in igc_init_empty_frame
  igc: Fix HW RX timestamp when passed by ZC XDP
  ixgbe: Fix possible skb NULL pointer dereference
  idpf: call set_real_num_queues in idpf_open
  idpf: record rx queue in skb for RSC packets
  idpf: fix handling rsc packet with a single segment
====================

Link: https://patch.msgid.link/20250211214343.4092496-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-12 19:53:03 -08:00
Jakub Kicinski 4e41231249 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
Intel Wired LAN Driver Updates 2025-02-10 (ice, igc, e1000e)

For ice:

Karol, Jake, and Michal add PTP support for E830 devices. Karol
refactors and cleans up PTP code. Jake allows for a common
cross-timestamp implementation to be shared for all devices and
Michal adds E830 support.

Mateusz cleans up initial Flow Director rule creation to loop rather
than duplicate repeated similar calls.

For igc:

Siang adjust calls to remove need for close and open calls on loading
XDP program.

For e1000e:

Gerhard Engleder batches register writes for writing multicast table
on real-time kernels.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  e1000e: Fix real-time violations on link up
  igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
  ice: refactor ice_fdir_create_dflt_rules() function
  ice: Implement PTP support for E830 devices
  ice: Refactor ice_ptp_init_tx_*
  ice: Add unified ice_capture_crosststamp
  ice: Process TSYN IRQ in a separate function
  ice: Use FIELD_PREP for timestamp values
  ice: Remove unnecessary ice_is_e8xx() functions
  ice: Don't check device type when checking GNSS presence
====================

Link: https://patch.msgid.link/20250210192352.3799673-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11 19:51:16 -08:00
Bart Van Assche e589adf5b7 iavf: Fix a locking bug in an error path
If the netdev lock has been obtained, unlock it before returning.
This bug has been detected by the Clang thread-safety analyzer.

Fixes: afc664987a ("eth: iavf: extend the netdev_lock usage")
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://patch.msgid.link/20250206175114.1974171-28-bvanassche@acm.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-11 18:02:04 -08:00
Song Yoong Siang 63f20f00d2 igc: Set buffer type for empty frames in igc_init_empty_frame
Set the buffer type to IGC_TX_BUFFER_TYPE_SKB for empty frame in the
igc_init_empty_frame function. This ensures that the buffer type is
correctly identified and handled during Tx ring cleanup.

Fixes: db0b124f02 ("igc: Enhance Qbv scheduling by using first flag bit")
Cc: stable@vger.kernel.org # 6.2+
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:11 -08:00
Zdenek Bouska 7822dd4d6d igc: Fix HW RX timestamp when passed by ZC XDP
Fixes HW RX timestamp in the following scenario:
- AF_PACKET socket with enabled HW RX timestamps is created
- AF_XDP socket with enabled zero copy is created
- frame is forwarded to the BPF program, where the timestamp should
  still be readable (extracted by igc_xdp_rx_timestamp(), kfunc
  behind bpf_xdp_metadata_rx_timestamp())
- the frame got XDP_PASS from BPF program, redirecting to the stack
- AF_PACKET socket receives the frame with HW RX timestamp

Moves the skb timestamp setting from igc_dispatch_skb_zc() to
igc_construct_skb_zc() so that igc_construct_skb_zc() is similar to
igc_construct_skb().

This issue can also be reproduced by running:
 # tools/testing/selftests/bpf/xdp_hw_metadata enp1s0
When a frame with the wrong port 9092 (instead of 9091) is used:
 # echo -n xdp | nc -u -q1 192.168.10.9 9092
then the RX timestamp is missing and xdp_hw_metadata prints:
 skb hwtstamp is not found!

With this fix or when copy mode is used:
 # tools/testing/selftests/bpf/xdp_hw_metadata -c enp1s0
then RX timestamp is found and xdp_hw_metadata prints:
 found skb hwtstamp = 1736509937.852786132

Fixes: 069b142f58 ("igc: Add support for PTP .getcyclesx64()")
Signed-off-by: Zdenek Bouska <zdenek.bouska@siemens.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Florian Bezdeka <florian.bezdeka@siemens.com>
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:11 -08:00
Piotr Kwapulinski 61fb097f9a ixgbe: Fix possible skb NULL pointer dereference
The commit c824125cbb ("ixgbe: Fix passing 0 to ERR_PTR in
ixgbe_run_xdp()") stopped utilizing the ERR-like macros for xdp status
encoding. Propagate this logic to the ixgbe_put_rx_buffer().

The commit also relaxed the skb NULL pointer check - caught by Smatch.
Restore this check.

Fixes: c824125cbb ("ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/intel-wired-lan/2c7d6c31-192a-4047-bd90-9566d0e14cc0@stanley.mountain/
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Saritha Sanigani <sarithax.sanigani@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:11 -08:00
Joshua Hay 52c11d31b5 idpf: call set_real_num_queues in idpf_open
On initial driver load, alloc_etherdev_mqs is called with whatever max
queue values are provided by the control plane. However, if the driver
is loaded on a system where num_online_cpus() returns less than the max
queues, the netdev will think there are more queues than are actually
available. Only num_online_cpus() will be allocated, but
skb_get_queue_mapping(skb) could possibly return an index beyond the
range of allocated queues. Consequently, the packet is silently dropped
and it appears as if TX is broken.

Set the real number of queues during open so the netdev knows how many
queues will be allocated.

Fixes: 1c325aac10 ("idpf: configure resources for TX queues")
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:11 -08:00
Sridhar Samudrala 2ff66c2f9e idpf: record rx queue in skb for RSC packets
Move the call to skb_record_rx_queue in idpf_rx_process_skb_fields()
so that RX queue is recorded for RSC packets too.

Fixes: 90912f9f4f ("idpf: convert header split mode to libeth + napi_build_skb()")
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:10 -08:00
Sridhar Samudrala 69ab25a74e idpf: fix handling rsc packet with a single segment
Handle rsc packet with a single segment same as a multi
segment rsc packet so that CHECKSUM_PARTIAL is set in the
skb->ip_summed field. The current code is passing CHECKSUM_NONE
resulting in TCP GRO layer doing checksum in SW and hiding the
issue. This will fail when using dmabufs as payload buffers as
skb frag would be unreadable.

Fixes: 3a8845af66 ("idpf: add RX splitq napi poll support")
Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Samuel Salin <Samuel.salin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-11 09:13:10 -08:00
Alexander Lobakin 2fc6b26ac8 ice: use generic unrolled_count() macro
ice, same as i40e, has a custom loop unrolling macros for unrolling
Tx descriptors filling on XSk xmit.
Replace ice defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-4-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 17:54:43 -08:00
Alexander Lobakin 9144e6f404 i40e: use generic unrolled_count() macro
i40e, as well as ice, has a custom loop unrolling macro for unrolling
Tx descriptors filling on XSk xmit.
Replace i40e defs with generic unrolled_count(), which is also more
convenient as it allows passing defines as its argument, not hardcoded
values, while the loop declaration will still be a usual for-loop.

Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://patch.msgid.link/20250206182630.3914318-3-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-10 17:54:43 -08:00
Gerhard Engleder 13e2297247 e1000e: Fix real-time violations on link up
Link down and up triggers update of MTA table. This update executes many
PCIe writes and a final flush. Thus, PCIe will be blocked until all
writes are flushed. As a result, DMA transfers of other targets suffer
from delay in the range of 50us. This results in timing violations on
real-time systems during link down and up of e1000e in combination with
an Intel i3-2310E Sandy Bridge CPU.

The i3-2310E is quite old. Launched 2011 by Intel but still in use as
robot controller. The exact root cause of the problem is unclear and
this situation won't change as Intel support for this CPU has ended
years ago. Our experience is that the number of posted PCIe writes needs
to be limited at least for real-time systems. With posted PCIe writes a
much higher throughput can be generated than with PCIe reads which
cannot be posted. Thus, the load on the interconnect is much higher.
Additionally, a PCIe read waits until all posted PCIe writes are done.
Therefore, the PCIe read can block the CPU for much more than 10us if a
lot of PCIe writes were posted before. Both issues are the reason why we
are limiting the number of posted PCIe writes in row in general for our
real-time systems, not only for this driver.

A flush after a low enough number of posted PCIe writes eliminates the
delay but also increases the time needed for MTA table update. The
following measurements were done on i3-2310E with e1000e for 128 MTA
table entries:

Single flush after all writes: 106us
Flush after every write:       429us
Flush after every 2nd write:   266us
Flush after every 4th write:   180us
Flush after every 8th write:   141us
Flush after every 16th write:  121us

A flush after every 8th write delays the link up by 35us and the
negative impact to DMA transfers of other targets is still tolerable.

Execute a flush after every 8th write. This prevents overloading the
interconnect with posted writes.

Signed-off-by: Gerhard Engleder <eg@keba.com>
Link: https://lore.kernel.org/netdev/f8fe665a-5e6c-4f95-b47a-2f3281aa0e6c@lunn.ch/T/
CC: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Song Yoong Siang be324b7903 igc: Avoid unnecessary link down event in XDP_SETUP_PROG process
The igc_close()/igc_open() functions are too drastic for installing a new
XDP prog because they cause undesirable link down event and device reset.

To avoid delays in Ethernet traffic, improve the XDP_SETUP_PROG process by
using the same sequence as igc_xdp_setup_pool(), which performs only the
necessary steps, as follows:
 1. stop the traffic and clean buffer
 2. stop NAPI
 3. install the XDP program
 4. resume NAPI
 5. allocate buffer and resume the traffic

This patch has been tested using the 'ip link set xdpdrv' command to attach
a simple XDP prog that always returns XDP_PASS.

Before this patch, attaching xdp program will cause ptp4l to lose sync for
few seconds, as shown in ptp4l log below:
  ptp4l[198.082]: rms    4 max    8 freq   +906 +/-   2 delay    12 +/-   0
  ptp4l[199.082]: rms    3 max    4 freq   +906 +/-   3 delay    12 +/-   0
  ptp4l[199.536]: port 1 (enp2s0): link down
  ptp4l[199.536]: port 1 (enp2s0): SLAVE to FAULTY on FAULT_DETECTED (FT_UNSPECIFIED)
  ptp4l[199.600]: selected local clock 22abbc.fffe.bb1234 as best master
  ptp4l[199.600]: port 1 (enp2s0): assuming the grand master role
  ptp4l[199.600]: port 1 (enp2s0): master state recommended in slave only mode
  ptp4l[199.600]: port 1 (enp2s0): defaultDS.priority1 probably misconfigured
  ptp4l[202.266]: port 1 (enp2s0): link up
  ptp4l[202.300]: port 1 (enp2s0): FAULTY to LISTENING on INIT_COMPLETE
  ptp4l[205.558]: port 1 (enp2s0): new foreign master 44abbc.fffe.bb2144-1
  ptp4l[207.558]: selected best master clock 44abbc.fffe.bb2144
  ptp4l[207.559]: port 1 (enp2s0): LISTENING to UNCALIBRATED on RS_SLAVE
  ptp4l[208.308]: port 1 (enp2s0): UNCALIBRATED to SLAVE on MASTER_CLOCK_SELECTED
  ptp4l[208.933]: rms  742 max 1303 freq   -195 +/- 682 delay    12 +/-   0
  ptp4l[209.933]: rms  178 max  274 freq   +387 +/- 243 delay    12 +/-   0

After this patch, attaching xdp program no longer cause ptp4l to lose sync,
as shown in ptp4l log below:
  ptp4l[201.183]: rms    1 max    3 freq   +959 +/-   1 delay     8 +/-   0
  ptp4l[202.183]: rms    1 max    3 freq   +961 +/-   2 delay     8 +/-   0
  ptp4l[203.183]: rms    2 max    3 freq   +958 +/-   2 delay     8 +/-   0
  ptp4l[204.183]: rms    3 max    5 freq   +961 +/-   3 delay     8 +/-   0
  ptp4l[205.183]: rms    2 max    4 freq   +964 +/-   3 delay     8 +/-   0

Besides, before this patch, attaching xdp program will causes flood ping to
lose 10 packets, as shown in ping statistics below:
  --- 169.254.1.2 ping statistics ---
  100000 packets transmitted, 99990 received, +6 errors, 0.01% packet loss, time 34001ms
  rtt min/avg/max/mdev = 0.028/0.301/3104.360/13.838 ms, pipe 10, ipg/ewma 0.340/0.243 ms

After this patch, attaching xdp program no longer cause flood ping to loss
any packets, as shown in ping statistics below:
  --- 169.254.1.2 ping statistics ---
  100000 packets transmitted, 100000 received, 0% packet loss, time 32326ms
  rtt min/avg/max/mdev = 0.027/0.231/19.589/0.155 ms, pipe 2, ipg/ewma 0.323/0.322 ms

On the other hand, this patch has been tested with tools/testing/selftests/
bpf/xdp_hw_metadata app to make sure AF_XDP zero-copy is working fine with
XDP Tx and Rx metadata. Below is the result of last packet after received
10000 UDP packets with interval 1 ms:
  poll: 1 (0) skip=0 fail=0 redir=10000
  xsk_ring_cons__peek: 1
  0x55881c7ef7a8: rx_desc[9999]->addr=8f110 addr=8f110 comp_addr=8f110 EoP
  rx_hash: 0xFB9BB6A3 with RSS type:0x1
  HW RX-time:   1733923136269470866 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (43.280 usec)
  XDP RX-time:   1733923136269482482 (sec:1733923136.2695) delta to User RX-time sec:0.0000 (31.664 usec)
  No rx_vlan_tci or rx_vlan_proto, err=-95
  0x55881c7ef7a8: ping-pong with csum=ab19 (want 315b) csum_start=34 csum_offset=6
  0x55881c7ef7a8: complete tx idx=9999 addr=f010
  HW TX-complete-time:   1733923136269591637 (sec:1733923136.2696) delta to User TX-complete-time sec:0.0001 (108.571 usec)
  XDP RX-time:   1733923136269482482 (sec:1733923136.2695) delta to User TX-complete-time sec:0.0002 (217.726 usec)
  HW RX-time:   1733923136269470866 (sec:1733923136.2695) delta to HW TX-complete-time sec:0.0001 (120.771 usec)
  0x55881c7ef7a8: complete rx idx=10127 addr=8f110

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Mateusz Polchlopek 5a7b0b6ff4 ice: refactor ice_fdir_create_dflt_rules() function
The Flow Director function ice_fdir_create_dflt_rules() calls few
times function ice_create_init_fdir_rule() each time with different
enum ice_fltr_ptype parameter. Next step is to return error code if
error occurred.

Change the code to store all necessary default rules in constant array
and call ice_create_init_fdir_rule() in the loop. It makes it easy to
extend the list of default rules in the future, without the need of
duplicate code more and more.

Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Michal Michalik f003075227 ice: Implement PTP support for E830 devices
Add specific functions and definitions for E830 devices to enable
PTP support.

E830 devices support direct write to GLTSYN_ registers without shadow
registers and 64 bit read of PHC time.

Enable PTM for E830 device, which is required for cross timestamp and
and dependency on PCIE_PTM for ICE_HWTS.

Check X86_FEATURE_ART for E830 as it may not be present in the CPU.

Cc: Anna-Maria Behnsen <anna-maria@linutronix.de>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Co-developed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Milena Olech <milena.olech@intel.com>
Co-developed-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Signed-off-by: Michal Michalik <michal.michalik@intel.com>
Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:48 -08:00
Karol Kolacinski 381d577962 ice: Refactor ice_ptp_init_tx_*
Unify ice_ptp_init_tx_* functions for most of the MAC types except E82X.
This simplifies the code for the future use with new MAC types.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Jacob Keller 92456e795a ice: Add unified ice_capture_crosststamp
Devices supported by ice driver use essentially the same logic for
performing a crosstimestamp. The only difference is that E830 hardware
has different offsets. Instead of having multiple implementations,
combine them into a single ice_capture_crosststamp() function.

To support both hardware types, the ice_capture_crosststamp function
must be able to determine the appropriate registers to access. To handle
this, pass a custom context structure instead of the PF pointer. This
structure, ice_crosststamp_ctx, contains a pointer to the PF, and
a pointer to the device configuration structure. This new structure also
will make it easier to implement historic snapshot support in a future
commit.

The device configuration structure is a static const data which defines
the offsets and flags for the various registers. This includes the lock
register, the cross timestamp control register, the upper and lower ART
system time capture registers, and the upper and lower device time
capture registers for each timer index.

Use the configuration structure to access all of the registers in
ice_capture_crosststamp(). Ensure that we don't over-run the device time
array by checking that the timer index is 0 or 1. Previously this was
simply assumed, and it would cause the device to read an incorrect and
likely garbage register.

It does feel like there should be a kernel interface for managing
register offsets like this, but the closest thing I saw was
<linux/regmap.h> which is interesting but not quite what we're looking
for...

Use rd32_poll_timeout() to read lock_reg and ctl_reg.

Add snapshot system time for historic interpolation.

Remove X86_FEATURE_ART and X86_FEATURE_TSC_KNOWN_FREQ from all E82X
devices because those are SoCs, which will always have those features.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski f9472aaabd ice: Process TSYN IRQ in a separate function
Simplify TSYN IRQ processing by moving it to a separate function and
having appropriate behavior per PHY model, instead of multiple
conditions not related to HW, but to specific timestamping modes.

When PTP is not enabled in the kernel, don't process timestamps and
return IRQ_HANDLED.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski ea7029fe10 ice: Use FIELD_PREP for timestamp values
Instead of using shifts and casts, use FIELD_PREP after reading 40b
timestamp values.

Rename a couple defines for better clarity and consistency.

Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:47 -08:00
Karol Kolacinski 9973ac9f23 ice: Remove unnecessary ice_is_e8xx() functions
Remove unnecessary ice_is_e8xx() functions and PHY model. Instead, use
MAC type where applicable.

Don't check device type in ice_ptp_maybe_trigger_tx_interrupt(), because
in reality it depends on the ready bitmap, which only E810 does not
have.

Call ice_ptp_cfg_phy_interrupt() unconditionally, because all further
function calls check the MAC type anyway and this allows simpler code
in the future with addition of the new MAC types.

Reorder ICE_MAC_* cases in switches in ice_ptp* as in enum ice_mac_type.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 10:43:36 -08:00
Karol Kolacinski e2c6737e6e ice: Don't check device type when checking GNSS presence
Don't check if the device type is E810T as non-E810T devices can support
GNSS too and PCA9575 check is enough to determine if GNSS is present or
not.

Rename ice_gnss_is_gps_present() to ice_gnss_is_module_present()
because GNSS module supports multiple GNSS providers, not only GPS.

Move functions related to PCA9575 from ice_ptp_hw.c to ice_common.c
to be able to access them when PTP is disabled in the kernel, but GNSS
is enabled.

Remove logical AND with ICE_AQC_LINK_TOPO_NODE_TYPE_M in
ice_get_pca9575_handle(), which has no effect, and reorder device type
checks to check the device_id first, then set other variables.

Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-10 08:52:04 -08:00
Jakub Kicinski 26db4dbb74 Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice: managing MSI-X in driver

Michal Swiatkowski says:

It is another try to allow user to manage amount of MSI-X used for each
feature in ice. First was via devlink resources API, it wasn't accepted
in upstream. Also static MSI-X allocation using devlink resources isn't
really user friendly.

This try is using more dynamic way. "Dynamic" across whole kernel when
platform supports it and "dynamic" across the driver when not.

To achieve that reuse global devlink parameter pf_msix_max and
pf_msix_min. It fits how ice hardware counts MSI-X. In case of ice amount
of MSI-X reported on PCI is a whole MSI-X for the card (with MSI-X for
VFs also). Having pf_msix_max allow user to statically set how many
MSI-X he wants on PF and how many should be reserved for VFs.

pf_msix_min is used to set minimum number of MSI-X with which ice driver
should probe correctly.

Meaning of this field in case of dynamic vs static allocation:
- on system with dynamic MSI-X allocation support
 * alloc pf_msix_min as static, rest will be allocated dynamically
- on system without dynamic MSI-X allocation support
 * try alloc pf_msix_max as static, minimum acceptable result is
 pf_msix_min

As Jesse and Piotr suggested pf_msix_max and pf_msix_min can (an
probably should) be stored in NVM. This patchset isn't implementing
that.

Dynamic (kernel or driver) way means that splitting MSI-X across the
RDMA and eth in case there is a MSI-X shortage isn't correct. Can work
when dynamic is only on driver site, but can't when dynamic is on kernel
site.

Let's remove this code and move to MSI-X allocation feature by feature.
If there is no more MSI-X for a feature, a feature is working with less
MSI-X or it is turned off.

There is a regression here. With MSI-X splitting user can run RDMA and
eth even on system with not enough MSI-X. Now only eth will work. RDMA
can be turned on by changing number of PF queues (lowering) and reprobe
RDMA driver.

Example:
72 CPU number, eth, RDMA and flow director (1 MSI-X), 1 MSI-X for OICR
on PF, and 1 more for RDMA. Card is using 1 + 72 + 1 + 72 + 1 = 147.

We set pf_msix_min = 2, pf_msix_max = 128

OICR: 1
eth: 72
flow director: 1
RDMA: 128 - 74 = 54

We can change number of queues on pf to 36 and do devlink reinit

OICR: 1
eth: 36
RDMA: 73
flow director: 1

We can also (implemented in "ice: enable_rdma devlink param") turned
RDMA off.

OICR: 1
eth: 72
RDMA: 0 (turned off)
flow director: 1

After this changes we have a static base vector for SRIOV (SIOV probably
in the feature). Last patch from this series is simplifying managing VF
MSI-X code based on static vector.

Now changing queues using ethtool is also changing MSI-X. If there is
enough MSI-X it is always one to one. When there is not enough there
will be more queues than MSI-X.

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: init flow director before RDMA
  ice: simplify VF MSI-X managing
  ice: enable_rdma devlink param
  ice: treat dyn_allowed only as suggestion
  ice, irdma: move interrupts code to irdma
  ice: get rid of num_lan_msix field
  ice: remove splitting MSI-X between features
  ice: devlink PF MSI-X max and min parameter
  ice: count combined queues using Rx/Tx count
====================

Link: https://patch.msgid.link/20250205185512.895887-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-06 17:34:36 -08:00
Michal Swiatkowski d67627e7b5 ice: init flow director before RDMA
Flow director needs only one MSI-X. Load it before RDMA to save MSI-X
for it.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:58 -08:00
Michal Swiatkowski a203163274 ice: simplify VF MSI-X managing
After implementing pf->msix.max field, base vector for other use cases
(like VFs) can be fixed. This simplify code when changing MSI-X amount
on particular VF, because there is no need to move a base vector.

A fixed base vector allows to reserve vectors from the beginning
instead of from the end, which is also simpler in code.

Store total and rest value in the same struct as max and min for PF.
Move tracking vectors from ice_sriov.c to ice_irq.c as it can be also
use for other none PF use cases (SIOV).

Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski 87181cd698 ice: enable_rdma devlink param
Implement enable_rdma devlink parameter to allow user to turn RDMA
feature on and off.

It is useful when there is no enough interrupts and user doesn't need
RDMA feature.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski a8c2d3932c ice: treat dyn_allowed only as suggestion
It can be needed to have some MSI-X allocated as static and rest as
dynamic. For example on PF VSI. We want to always have minimum one MSI-X
on it, because of that it is allocated as a static one, rest can be
dynamic if it is supported.

Change the ice_get_irq_res() to allow using static entries if they are
free even if caller wants dynamic one.

Adjust limit values to the new approach. Min and max in limit means the
values that are valid, so decrease max and num_static by one.

Set vsi::irq_dyn_alloc if dynamic allocation is supported.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski 3e0d3cb3fb ice, irdma: move interrupts code to irdma
Move responsibility of MSI-X requesting for RDMA feature from ice driver
to irdma driver. It is done to allow simple fallback when there is not
enough MSI-X available.

Change amount of MSI-X used for control from 4 to 1, as it isn't needed
to have more than one MSI-X for this purpose.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski ad61cd9c67 ice: get rid of num_lan_msix field
Remove the field to allow having more queues than MSI-X on VSI. As
default the number will be the same, but if there won't be more MSI-X
available VSI can run with at least one MSI-X.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski 79d97b8cf9 ice: remove splitting MSI-X between features
With dynamic approach to alloc MSI-X there is no sense to statically
split MSI-X between PF features.

Splitting was also calculating needed MSI-X. Move this part to separate
function and use as max value.

Remove ICE_ESWITCH_MSIX, as there is no need for additional MSI-X for
switchdev.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:04:57 -08:00
Michal Swiatkowski b2657259fc ice: devlink PF MSI-X max and min parameter
Use generic devlink PF MSI-X parameter to allow user to change MSI-X
range.

Add notes about this parameters into ice devlink documentation.

Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-05 09:02:44 -08:00
Michal Swiatkowski c3a392bdd3 ice: count combined queues using Rx/Tx count
Previous implementation assumes that there is 1:1 matching between
vectors and queues. It isn't always true.

Get minimum value from Rx/Tx queues to determine combined queues number.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-02-04 15:08:57 -08:00
Jakub Kicinski 88be092224 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:

====================
ice: fix Rx data path for heavy 9k MTU traffic

Maciej Fijalkowski says:

This patchset fixes a pretty nasty issue that was reported by RedHat
folks which occurred after ~30 minutes (this value varied, just trying
here to state that it was not observed immediately but rather after a
considerable longer amount of time) when ice driver was tortured with
jumbo frames via mix of iperf traffic executed simultaneously with
wrk/nginx on client/server sides (HTTP and TCP workloads basically).

The reported splats were spanning across all the bad things that can
happen to the state of page - refcount underflow, use-after-free, etc.
One of these looked as follows:

[ 2084.019891] BUG: Bad page state in process swapper/34  pfn:97fcd0
[ 2084.025990] page:00000000a60ee772 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x97fcd0
[ 2084.035462] flags: 0x17ffffc0000000(node=0|zone=2|lastcpupid=0x1fffff)
[ 2084.041990] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
[ 2084.049730] raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
[ 2084.057468] page dumped because: nonzero _refcount
[ 2084.062260] Modules linked in: bonding tls sunrpc intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common i10nm_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mgag200 irqd
[ 2084.137829] CPU: 34 PID: 0 Comm: swapper/34 Kdump: loaded Not tainted 5.14.0-427.37.1.el9_4.x86_64 #1
[ 2084.147039] Hardware name: Dell Inc. PowerEdge R750/0216NK, BIOS 1.13.2 12/19/2023
[ 2084.154604] Call Trace:
[ 2084.157058]  <IRQ>
[ 2084.159080]  dump_stack_lvl+0x34/0x48
[ 2084.162752]  bad_page.cold+0x63/0x94
[ 2084.166333]  check_new_pages+0xb3/0xe0
[ 2084.170083]  rmqueue_bulk+0x2d2/0x9e0
[ 2084.173749]  ? ktime_get+0x35/0xa0
[ 2084.177159]  rmqueue_pcplist+0x13b/0x210
[ 2084.181081]  rmqueue+0x7d3/0xd40
[ 2084.184316]  ? xas_load+0x9/0xa0
[ 2084.187547]  ? xas_find+0x183/0x1d0
[ 2084.191041]  ? xa_find_after+0xd0/0x130
[ 2084.194879]  ? intel_iommu_iotlb_sync_map+0x89/0xe0
[ 2084.199759]  get_page_from_freelist+0x11f/0x530
[ 2084.204291]  __alloc_pages+0xf2/0x250
[ 2084.207958]  ice_alloc_rx_bufs+0xcc/0x1c0 [ice]
[ 2084.212543]  ice_clean_rx_irq+0x631/0xa20 [ice]
[ 2084.217111]  ice_napi_poll+0xdf/0x2a0 [ice]
[ 2084.221330]  __napi_poll+0x27/0x170
[ 2084.224824]  net_rx_action+0x233/0x2f0
[ 2084.228575]  __do_softirq+0xc7/0x2ac
[ 2084.232155]  __irq_exit_rcu+0xa1/0xc0
[ 2084.235821]  common_interrupt+0x80/0xa0
[ 2084.239662]  </IRQ>
[ 2084.241768]  <TASK>

The fix is mostly about reverting what was done in commit 1dc1a7e7f4
("ice: Centrallize Rx buffer recycling") followed by proper timing on
page_count() storage and then removing the ice_rx_buf::act related logic
(which was mostly introduced for purposes from cited commit).

Special thanks to Xu Du for providing reproducer and Jacob Keller for
initial extensive analysis.

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
  ice: stop storing XDP verdict within ice_rx_buf
  ice: gather page_count()'s of each frag right before XDP prog call
  ice: put Rx buffers after being done with current frame
====================

Link: https://patch.msgid.link/20250131185415.3741532-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-03 13:18:44 -08:00
Jiasheng Jiang a8aa6a6ddc ice: Add check for devm_kzalloc()
Add check for the return value of devm_kzalloc() to guarantee the success
of allocation.

Fixes: 42c2eb6b1f ("ice: Implement devlink-rate API")
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Link: https://patch.msgid.link/20250131013832.24805-1-jiashengjiangcool@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-02-01 16:57:50 -08:00
Maciej Fijalkowski 468a1952df ice: stop storing XDP verdict within ice_rx_buf
Idea behind having ice_rx_buf::act was to simplify and speed up the Rx
data path by walking through buffers that were representing cleaned HW
Rx descriptors. Since it caused us a major headache recently and we
rolled back to old approach that 'puts' Rx buffers right after running
XDP prog/creating skb, this is useless now and should be removed.

Get rid of ice_rx_buf::act and related logic. We still need to take care
of a corner case where XDP program releases a particular fragment.

Make ice_run_xdp() to return its result and use it within
ice_put_rx_mbuf().

Fixes: 2fba7dc515 ("ice: Add support for XDP multi-buffer on Rx side")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 10:07:46 -08:00
Maciej Fijalkowski 11c4aa074d ice: gather page_count()'s of each frag right before XDP prog call
If we store the pgcnt on few fragments while being in the middle of
gathering the whole frame and we stumbled upon DD bit not being set, we
terminate the NAPI Rx processing loop and come back later on. Then on
next NAPI execution we work on previously stored pgcnt.

Imagine that second half of page was used actively by networking stack
and by the time we came back, stack is not busy with this page anymore
and decremented the refcnt. The page reuse algorithm in this case should
be good to reuse the page but given the old refcnt it will not do so and
attempt to release the page via page_frag_cache_drain() with
pagecnt_bias used as an arg. This in turn will result in negative refcnt
on struct page, which was initially observed by Xu Du.

Therefore, move the page count storage from ice_get_rx_buf() to a place
where we are sure that whole frame has been collected, but before
calling XDP program as it internally can also change the page count of
fragments belonging to xdp_buff.

Fixes: ac07533911 ("ice: Store page count inside ice_rx_buf")
Reported-and-tested-by: Xu Du <xudu@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 10:06:26 -08:00
Maciej Fijalkowski 743bbd93cf ice: put Rx buffers after being done with current frame
Introduce a new helper ice_put_rx_mbuf() that will go through gathered
frags from current frame and will call ice_put_rx_buf() on them. Current
logic that was supposed to simplify and optimize the driver where we go
through a batch of all buffers processed in current NAPI instance turned
out to be broken for jumbo frames and very heavy load that was coming
from both multi-thread iperf and nginx/wrk pair between server and
client. The delay introduced by approach that we are dropping is simply
too big and we need to take the decision regarding page
recycling/releasing as quick as we can.

While at it, address an error path of ice_add_xdp_frag() - we were
missing buffer putting from day 1 there.

As a nice side effect we get rid of annoying and repetitive three-liner:

	xdp->data = NULL;
	rx_ring->first_desc = ntc;
	rx_ring->nr_frags = 0;

by embedding it within introduced routine.

Fixes: 1dc1a7e7f4 ("ice: Centrallize Rx buffer recycling")
Reported-and-tested-by: Xu Du <xudu@redhat.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Co-developed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-31 09:51:47 -08:00
Michal Swiatkowski ee7d79433d iavf: allow changing VLAN state without calling PF
First case:
> ip l a l $VF name vlanx type vlan id 100
> ip l d vlanx
> ip l a l $VF name vlanx type vlan id 100

As workqueue can be execute after sometime, there is a window to have
call trace like that:
- iavf_del_vlan
- iavf_add_vlan
- iavf_del_vlans (wq)

It means that our VLAN 100 will change the state from IAVF_VLAN_ACTIVE
to IAVF_VLAN_REMOVE (iavf_del_vlan). After that in iavf_add_vlan state
won't be changed because VLAN 100 is on the filter list. The final
result is that the VLAN 100 filter isn't added in hardware (no
iavf_add_vlans call).

To fix that change the state if the filter wasn't removed yet directly
to active. It is save as IAVF_VLAN_REMOVE means that virtchnl message
wasn't sent yet.

Second case:
> ip l a l $VF name vlanx type vlan id 100
Any type of VF reset ex. change trust
> ip l s $PF vf $VF_NUM trust on
> ip l d vlanx
> ip l a l $VF name vlanx type vlan id 100

In case of reset iavf driver is responsible for readding all filters
that are being used. To do that all VLAN filters state are changed to
IAVF_VLAN_ADD. Here is even longer window for changing VLAN state from
kernel side, as workqueue isn't called immediately. We can have call
trace like that:

- changing to IAVF_VLAN_ADD (after reset)
- iavf_del_vlan (called from kernel ops)
- iavf_del_vlans (wq)

Not exsisitng VLAN filters will be removed from hardware. It isn't a
bug, ice driver will handle it fine. However, we can have call trace
like that:

- changing to IAVF_VLAN_ADD (after reset)
- iavf_del_vlan (called from kernel ops)
- iavf_add_vlan (called from kernel ops)
- iavf_del_vlans (wq)

With fix for previous case we end up with no VLAN filters in hardware.
We have to remove VLAN filters if the state is IAVF_VLAN_ADD and delete
VLAN was called. It is save as IAVF_VLAN_ADD means that virtchnl message
wasn't sent yet.

Fixes: 0c0da0e951 ("iavf: refactor VLAN filter states")
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:52:09 -08:00
Mateusz Polchlopek c5cc2a27e0 ice: remove invalid parameter of equalizer
It occurred that in the commit 70838938e8 ("ice: Implement driver
functionality to dump serdes equalizer values") the invalid DRATE parameter
for reading has been added. The output of the command:

  $ ethtool -d <ethX>

returns the garbage value in the place where DRATE value should be
stored.

Remove mentioned parameter to prevent return of corrupted data to
userspace.

Fixes: 70838938e8 ("ice: Implement driver functionality to dump serdes equalizer values")
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:49:42 -08:00
Przemek Kitszel 18625e26fe ice: fix ice_parser_rt::bst_key array size
Fix &ice_parser_rt::bst_key size. It was wrongly set to 10 instead of 20
in the initial impl commit (see Fixes tag). All usage code assumed it was
of size 20. That was also the initial size present up to v2 of the intro
series [2], but halved by v3 [3] refactor described as "Replace magic
hardcoded values with macros." The introducing series was so big that
some ugliness was unnoticed, same for bugs :/

ICE_BST_KEY_TCAM_SIZE and ICE_BST_TCAM_KEY_SIZE were differing by one.
There was tmp variable @j in the scope of edited function, but was not
used in all places. This ugliness is now gone.
I'm moving ice_parser_rt::pg_prio a few positions up, to fill up one of
the holes in order to compensate for the added 10 bytes to the ::bst_key,
resulting in the same size of the whole as prior to the fix, and minimal
changes in the offsets of the fields.

Extend also the debug dump print of the key to cover all bytes. To not
have string with 20 "%02x" and 20 params, switch to
ice_debug_array_w_prefix().

This fix obsoletes Ahmed's attempt at [1].

[1] https://lore.kernel.org/intel-wired-lan/20240823230847.172295-1-ahmed.zaki@intel.com
[2] https://lore.kernel.org/intel-wired-lan/20230605054641.2865142-13-junfeng.guo@intel.com
[3] https://lore.kernel.org/intel-wired-lan/20230817093442.2576997-13-junfeng.guo@intel.com

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/intel-wired-lan/b1fb6ff9-b69e-4026-9988-3c783d86c2e0@stanley.mountain
Fixes: 9a4c07aaa0 ("ice: add parser execution main loop")
CC: Ahmed Zaki <ahmed.zaki@intel.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:49:30 -08:00
Manoj Vishwanathan d0ea9ebac3 idpf: add more info during virtchnl transaction timeout/salt mismatch
Add more information related to the transaction like cookie, vc_op,
salt when transaction times out and include similar information
when transaction salt does not match.

Info output for transaction timeout:
-------------------
(op:5015 cookie:45fe vc_op:5015 salt:45 timeout:60000ms)
-------------------

before it was:

-------------------
(op 5015, 60000ms)
-------------------

Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:03:39 -08:00
Marco Leogrande 9a5b021cb8 idpf: convert workqueues to unbound
When a workqueue is created with `WQ_UNBOUND`, its work items are
served by special worker-pools, whose host workers are not bound to
any specific CPU. In the default configuration (i.e. when
`queue_delayed_work` and friends do not specify which CPU to run the
work item on), `WQ_UNBOUND` allows the work item to be executed on any
CPU in the same node of the CPU it was enqueued on. While this
solution potentially sacrifices locality, it avoids contention with
other processes that might dominate the CPU time of the processor the
work item was scheduled on.

This is not just a theoretical problem: in a particular scenario
misconfigured process was hogging most of the time from CPU0, leaving
less than 0.5% of its CPU time to the kworker. The IDPF workqueues
that were using the kworker on CPU0 suffered large completion delays
as a result, causing performance degradation, timeouts and eventual
system crash.

Tested:

* I have also run a manual test to gauge the performance
  improvement. The test consists of an antagonist process
  (`./stress --cpu 2`) consuming as much of CPU 0 as possible. This
  process is run under `taskset 01` to bind it to CPU0, and its
  priority is changed with `chrt -pQ 9900 10000 ${pid}` and
  `renice -n -20 ${pid}` after start.

  Then, the IDPF driver is forced to prefer CPU0 by editing all calls
  to `queue_delayed_work`, `mod_delayed_work`, etc... to use CPU 0.

  Finally, `ktraces` for the workqueue events are collected.

  Without the current patch, the antagonist process can force
  arbitrary delays between `workqueue_queue_work` and
  `workqueue_execute_start`, that in my tests were as high as
  `30ms`. With the current patch applied, the workqueue can be
  migrated to another unloaded CPU in the same node, and, keeping
  everything else equal, the maximum delay I could see was `6us`.

Fixes: 0fe45467a1 ("idpf: add create vport and netdev configuration")
Signed-off-by: Marco Leogrande <leogrande@google.com>
Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:03:39 -08:00
Manoj Vishwanathan d15fe4edd7 idpf: Acquire the lock before accessing the xn->salt
The transaction salt was being accessed before acquiring the
idpf_vc_xn_lock when idpf has to forward the virtchnl reply.

Fixes: 34c21fa894 ("idpf: implement virtchnl transaction manager")
Signed-off-by: Manoj Vishwanathan <manojvishy@google.com>
Signed-off-by: David Decotigny <decot@google.com>
Signed-off-by: Brian Vazquez <brianvv@google.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Pavan Kumar Linga <pavan.kumar.linga@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:03:39 -08:00
Emil Tantilov 137da75ba7 idpf: fix transaction timeouts on reset
Restore the call to idpf_vc_xn_shutdown() at the beginning of
idpf_vc_core_deinit() provided the function is not called on remove.
In the reset path the mailbox is destroyed, leading to all transactions
timing out.

Fixes: 09d0fb5cb3 ("idpf: deinit virtchnl transaction manager after vport and vectors")
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:03:39 -08:00
Emil Tantilov 396f016567 idpf: add read memory barrier when checking descriptor done bit
Add read memory barrier to ensure the order of operations when accessing
control queue descriptors. Specifically, we want to avoid cases where loads
can be reordered:

1. Load #1 is dispatched to read descriptor flags.
2. Load #2 is dispatched to read some other field from the descriptor.
3. Load #2 completes, accessing memory/cache at a point in time when the DD
   flag is zero.
4. NIC DMA overwrites the descriptor, now the DD flag is one.
5. Any fields loaded before step 4 are now inconsistent with the actual
   descriptor state.

Add read memory barrier between steps 1 and 2, so that load #2 is not
executed until load #1 has completed.

Fixes: 8077c72756 ("idpf: add controlq init and reset checks")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Suggested-by: Lance Richardson <rlance@google.com>
Signed-off-by: Emil Tantilov <emil.s.tantilov@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-24 10:03:39 -08:00
Linus Torvalds 0ad9617c78 Networking changes for 6.14.
Core
 ----
 
  - More core refactoring to reduce the RTNL lock contention,
    including preparatory work for the per-network namespace RTNL lock,
    replacing RTNL lock with a per device-one to protect NAPI-related
    net device data and moving synchronize_net() calls outside such
    lock.
 
  - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge and
    more specific TCP coverage.
 
  - Reduce network namespace tear-down time by removing per-subsystems
    synchronize_net() in tipc and sched.
 
  - Add flow label selector support for fib rules, allowing traffic
    redirection based on such header field.
 
 Netfilter
 ---------
 
  - Do not remove netdev basechain when last device is gone, allowing
    netdev basechains without devices.
 
  - Revisit the flowtable teardown strategy, dealing better with fin,
    reset and re-open events.
 
  - Scale-up IP-vs connection dumping by avoiding linear search on
    each restart.
 
 Protocols
 ---------
 
  - A significant XDP socket refactor, consolidating and optimizing
    several helpers into the core
 
  - Better scaling of ICMP rate-limiting, by removing false-sharing in
    inet peers handling.
 
  - Introduces netlink notifications for multicast IPv4 and IPv6
    address changes.
 
  - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
    aggregation and fragmentation of the inner IP.
 
  - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets,
    to avoid local port exhaustion issues when the average connection
    lifetime is very short.
 
  - Support updating keys (re-keying) for connections using kernel
    TLS (for TLS 1.3 only).
 
  - Support ipv4-mapped ipv6 address clients in smc-r v2.
 
  - Add support for jumbo data packet transmission in RxRPC sockets,
    gluing multiple data packets in a single UDP packet.
 
  - Support RxRPC RACK-TLP to manage packet loss and retransmission in
    conjunction with the congestion control algorithm.
 
 Driver API
 ----------
 
  - Introduce a unified and structured interface for reporting PHY
    statistics, exposing consistent data across different H/W via
    ethtool.
 
  - Make timestamping selectable, allow the user to select the desired
    hwtstamp provider (PHY or MAC) administratively.
 
  - Add support for configuring a header-data-split threshold (HDS)
    value via ethtool, to deal with partial or buggy H/W implementation.
 
  - Consolidate DSA drivers Energy Efficiency Ethernet support.
 
  - Add EEE management to phylink, making use of the phylib
    implementation.
 
  - Add phylib support for in-band capabilities negotiation.
 
  - Simplify how phylib-enabled mac drivers expose the supported
    interfaces.
 
 Tests and tooling
 -----------------
 
  - Make the YNL tool package-friendly to make it easier to deploy it
    separately from the kernel.
 
  - Increase TCP selftest coverage importing several packetdrill
    test-cases.
 
  - Regenerate the ethtool uapi header from the YNL spec,
    to ease maintenance and future development.
 
  - Add YNL support for decoding the link types used in net
    self-tests, allowing a single build to run both net and
    drivers/net.
 
 Drivers
 -------
 
  - Ethernet high-speed NICs:
    - nVidia/Mellanox (mlx5):
      - add cross E-Switch QoS support
      - add SW Steering support for ConnectX-8
      - implement support for HW-Managed Flow Steering, improving the
        rule deletion/insertion rate
      - support for multi-host LAG
    - Intel (ixgbe, ice, igb):
      - ice: add support for devlink health events
      - ixgbe: add initial support for E610 chipset variant
      - igb: add support for AF_XDP zero-copy
    - Meta:
      - add support for basic RSS config
      - allow changing the number of channels
      - add hardware monitoring support
    - Broadcom (bnxt):
      - implement TCP data split and HDS threshold ethtool support,
        enabling Device Memory TCP.
    - Marvell Octeon:
      - implement egress ipsec offload support for the cn10k family
    - Hisilicon (HIBMC):
      - implement unicast MAC filtering
 
  - Ethernet NICs embedded and virtual:
    - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
      contented atomic operations for drop counters
    - Freescale:
      - quicc: phylink conversion
      - enetc: support Tx and Rx checksum offload and improve TSO
        performances
    - MediaTek:
      - airoha: introduce support for ETS and HTB Qdisc offload
    - Microchip:
      - lan78XX USB: preparation work for phylink conversion
    - Synopsys (stmmac):
      - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
      - refactor EEE support to leverage the new driver API
      - optimize DMA and cache access to increase raw RX performances
        by 40%
    - TI:
      - icssg-prueth: add multicast filtering support for VLAN
        interface
    - netkit:
      - add ability to configure head/tailroom
    - VXLAN:
      - accepts packets with user-defined reserved bit
 
  - Ethernet switches:
    - Microchip:
      - lan969x: add RGMII support
      - lan969x: improve TX and RX performance using the FDMA engine
    - nVidia/Mellanox:
      - move Tx header handling to PCI driver, to ease XDP support
 
  - Ethernet PHYs:
    - Texas Instruments DP83822:
      - add support for GPIO2 clock output
    - Realtek:
      - 8169: add support for RTL8125D rev.b
      - rtl822x: add hwmon support for the temperature sensor
    - Microchip:
      - add support for RDS PTP hardware
      - consolidate periodic output signal generation
 
  - CAN:
    - several DT-bindings to DT schema conversions
    - tcan4x5x:
      - add HW standby support
      - support nWKRQ voltage selection
    - kvaser:
      - allowing Bus Error Reporting runtime configuration
 
  - WiFi:
    - the on-going Multi-Link Operation (MLO) effort continues, affecting
      both the stack and in drivers
    - mac80211/cfg80211:
      - Emergency Preparedness Communication Services (EPCS) station mode
        support
      - support for adding and removing station links for MLO
      - add support for WiFi 7/EHT mesh over 320 MHz channels
      - report Tx power info for each link
    - RealTek (rtw88):
      - enable USB Rx aggregation and USB 3 to improve performance
      - LED support
    - RealTek (rtw89):
      - refactor power save to support Multi-Link Operations
      - add support for RTL8922AE-VS variant
    - MediaTek (mt76):
      - single wiphy multiband support (preparation for MLO)
      - p2p device support
      - add TP-Link TXE50UH USB adapter support
    - Qualcomm (ath10k):
      - support for the QCA6698AQ IP core
    - Qualcomm (ath12k):
      - enable MLO for QCN9274
 
  - Bluetooth:
    - Allow sysfs to trigger hdev reset, to allow recovering devices
      not responsive from user-space
    - MediaTek: add support for MT7922, MT7925, MT7921e devices
    - Realtek: add support for RTL8851BE devices
    - Qualcomm: add support for WCN785x devices
    - ISO: allow BIG re-sync
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmePf5YSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkUcMQALblhkGTxurnfT+yK+Bsuhn2LoHl2RPN
 4u2Kjkzm+2FYgcw6lS17cFXsnfAPlRIpmhnmKk1EBgsBdkuL29c+jtqnljA2bboD
 tIMhMgWiaLS3xgEMrLeKnseIo0G9mviQRphGeZPFTaLb4Ww/bd5LAp4ZGc5oij76
 tURatC3b6MuO4Lt5U+jWKnRwviXku8udHkVHXlvPdirawHCVinmx3tvce/BI/MaD
 eUOp6ZeJCPCOLtk7b8WEyxxvdY0f6D9ed82qfPDHjb94SJv+Vxb38RZtNuApIjn9
 S0KdlNih/4flDy17LDxGYSyFps78lUFRbpqmsUlnZkyLXpsph7/WTvAmMAFcrX0K
 UgQ/F/q5GAvcP5WZcCj5+tZaRmfKQraQirXMtYU/Uj50qCnSU7ssyACASt23GLZ8
 OF8tCLlm9lLOU1B6Ofkul1Dbo5f0Xpaghga4dFb0kzSfbm78fTUnqBNsJ7jIkWfi
 fD6dO+fg+p2ZMD0CACGo3CNxQuJmaQWg6BIDeno6God8kZ6qBMxY/sFr4qozrvFH
 x/FgQq8dgc8WLmaPejKiNIPkdQepXrIiv3T9jgMVyEjJnWB/LBfyWKSQOdTfnLs+
 rgr4YMV6XW4bx0fYqTI8B9jZ+FCWbG6sn4UtRTHITKcd3FSvd8Y+PHa5YyCUWvJM
 l8pePMGF0XVF
 =hrsp
 -----END PGP SIGNATURE-----

Merge tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next

Pull networking updates from Paolo Abeni:
 "This is slightly smaller than usual, with the most interesting work
  being still around RTNL scope reduction.

  Core:

   - More core refactoring to reduce the RTNL lock contention, including
     preparatory work for the per-network namespace RTNL lock, replacing
     RTNL lock with a per device-one to protect NAPI-related net device
     data and moving synchronize_net() calls outside such lock.

   - Extend drop reasons usage, adding net scheduler, AF_UNIX, bridge
     and more specific TCP coverage.

   - Reduce network namespace tear-down time by removing per-subsystems
     synchronize_net() in tipc and sched.

   - Add flow label selector support for fib rules, allowing traffic
     redirection based on such header field.

  Netfilter:

   - Do not remove netdev basechain when last device is gone, allowing
     netdev basechains without devices.

   - Revisit the flowtable teardown strategy, dealing better with fin,
     reset and re-open events.

   - Scale-up IP-vs connection dumping by avoiding linear search on each
     restart.

  Protocols:

   - A significant XDP socket refactor, consolidating and optimizing
     several helpers into the core

   - Better scaling of ICMP rate-limiting, by removing false-sharing in
     inet peers handling.

   - Introduces netlink notifications for multicast IPv4 and IPv6
     address changes.

   - Add ipsec support for IP-TFS/AggFrag encapsulation, allowing
     aggregation and fragmentation of the inner IP.

   - Add sysctl to configure TIME-WAIT reuse delay for TCP sockets, to
     avoid local port exhaustion issues when the average connection
     lifetime is very short.

   - Support updating keys (re-keying) for connections using kernel TLS
     (for TLS 1.3 only).

   - Support ipv4-mapped ipv6 address clients in smc-r v2.

   - Add support for jumbo data packet transmission in RxRPC sockets,
     gluing multiple data packets in a single UDP packet.

   - Support RxRPC RACK-TLP to manage packet loss and retransmission in
     conjunction with the congestion control algorithm.

  Driver API:

   - Introduce a unified and structured interface for reporting PHY
     statistics, exposing consistent data across different H/W via
     ethtool.

   - Make timestamping selectable, allow the user to select the desired
     hwtstamp provider (PHY or MAC) administratively.

   - Add support for configuring a header-data-split threshold (HDS)
     value via ethtool, to deal with partial or buggy H/W
     implementation.

   - Consolidate DSA drivers Energy Efficiency Ethernet support.

   - Add EEE management to phylink, making use of the phylib
     implementation.

   - Add phylib support for in-band capabilities negotiation.

   - Simplify how phylib-enabled mac drivers expose the supported
     interfaces.

  Tests and tooling:

   - Make the YNL tool package-friendly to make it easier to deploy it
     separately from the kernel.

   - Increase TCP selftest coverage importing several packetdrill
     test-cases.

   - Regenerate the ethtool uapi header from the YNL spec, to ease
     maintenance and future development.

   - Add YNL support for decoding the link types used in net self-tests,
     allowing a single build to run both net and drivers/net.

  Drivers:

   - Ethernet high-speed NICs:
      - nVidia/Mellanox (mlx5):
         - add cross E-Switch QoS support
         - add SW Steering support for ConnectX-8
         - implement support for HW-Managed Flow Steering, improving the
           rule deletion/insertion rate
         - support for multi-host LAG
      - Intel (ixgbe, ice, igb):
         - ice: add support for devlink health events
         - ixgbe: add initial support for E610 chipset variant
         - igb: add support for AF_XDP zero-copy
      - Meta:
         - add support for basic RSS config
         - allow changing the number of channels
         - add hardware monitoring support
      - Broadcom (bnxt):
         - implement TCP data split and HDS threshold ethtool support,
           enabling Device Memory TCP.
      - Marvell Octeon:
         - implement egress ipsec offload support for the cn10k family
      - Hisilicon (HIBMC):
         - implement unicast MAC filtering

   - Ethernet NICs embedded and virtual:
      - Convert UDP tunnel drivers to NETDEV_PCPU_STAT_DSTATS, avoiding
        contented atomic operations for drop counters
      - Freescale:
         - quicc: phylink conversion
         - enetc: support Tx and Rx checksum offload and improve TSO
           performances
      - MediaTek:
         - airoha: introduce support for ETS and HTB Qdisc offload
      - Microchip:
         - lan78XX USB: preparation work for phylink conversion
      - Synopsys (stmmac):
         - support DWMAC IP on NXP Automotive SoCs S32G2xx/S32G3xx/S32R45
         - refactor EEE support to leverage the new driver API
         - optimize DMA and cache access to increase raw RX performances
           by 40%
      - TI:
         - icssg-prueth: add multicast filtering support for VLAN
           interface
      - netkit:
         - add ability to configure head/tailroom
      - VXLAN:
         - accepts packets with user-defined reserved bit

   - Ethernet switches:
      - Microchip:
         - lan969x: add RGMII support
         - lan969x: improve TX and RX performance using the FDMA engine
      - nVidia/Mellanox:
         - move Tx header handling to PCI driver, to ease XDP support

   - Ethernet PHYs:
      - Texas Instruments DP83822:
         - add support for GPIO2 clock output
      - Realtek:
         - 8169: add support for RTL8125D rev.b
         - rtl822x: add hwmon support for the temperature sensor
      - Microchip:
         - add support for RDS PTP hardware
         - consolidate periodic output signal generation

   - CAN:
      - several DT-bindings to DT schema conversions
      - tcan4x5x:
         - add HW standby support
         - support nWKRQ voltage selection
      - kvaser:
         - allowing Bus Error Reporting runtime configuration

   - WiFi:
      - the on-going Multi-Link Operation (MLO) effort continues,
        affecting both the stack and in drivers
      - mac80211/cfg80211:
         - Emergency Preparedness Communication Services (EPCS) station
           mode support
         - support for adding and removing station links for MLO
         - add support for WiFi 7/EHT mesh over 320 MHz channels
         - report Tx power info for each link
      - RealTek (rtw88):
         - enable USB Rx aggregation and USB 3 to improve performance
         - LED support
      - RealTek (rtw89):
         - refactor power save to support Multi-Link Operations
         - add support for RTL8922AE-VS variant
      - MediaTek (mt76):
         - single wiphy multiband support (preparation for MLO)
         - p2p device support
         - add TP-Link TXE50UH USB adapter support
      - Qualcomm (ath10k):
         - support for the QCA6698AQ IP core
      - Qualcomm (ath12k):
         - enable MLO for QCN9274

   - Bluetooth:
      - Allow sysfs to trigger hdev reset, to allow recovering devices
        not responsive from user-space
      - MediaTek: add support for MT7922, MT7925, MT7921e devices
      - Realtek: add support for RTL8851BE devices
      - Qualcomm: add support for WCN785x devices
      - ISO: allow BIG re-sync"

* tag 'net-next-6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1386 commits)
  net/rose: prevent integer overflows in rose_setsockopt()
  net: phylink: fix regression when binding a PHY
  net: ethernet: ti: am65-cpsw: streamline TX queue creation and cleanup
  net: ethernet: ti: am65-cpsw: streamline RX queue creation and cleanup
  net: ethernet: ti: am65-cpsw: ensure proper channel cleanup in error path
  ipv6: Convert inet6_rtm_deladdr() to per-netns RTNL.
  ipv6: Convert inet6_rtm_newaddr() to per-netns RTNL.
  ipv6: Move lifetime validation to inet6_rtm_newaddr().
  ipv6: Set cfg.ifa_flags before device lookup in inet6_rtm_newaddr().
  ipv6: Pass dev to inet6_addr_add().
  ipv6: Convert inet6_ioctl() to per-netns RTNL.
  ipv6: Hold rtnl_net_lock() in addrconf_init() and addrconf_cleanup().
  ipv6: Hold rtnl_net_lock() in addrconf_dad_work().
  ipv6: Hold rtnl_net_lock() in addrconf_verify_work().
  ipv6: Convert net.ipv6.conf.${DEV}.XXX sysctl to per-netns RTNL.
  ipv6: Add __in6_dev_get_rtnl_net().
  net: stmmac: Drop redundant skb_mark_for_recycle() for SKB frags
  net: mii: Fix the Speed display when the network cable is not connected
  sysctl net: Remove macro checks for CONFIG_SYSCTL
  eth: bnxt: update header sizing defaults
  ...
2025-01-22 08:28:57 -08:00
Linus Torvalds 1d6d399223 Kthreads affinity follow either of 4 existing different patterns:
1) Per-CPU kthreads must stay affine to a single CPU and never execute
    relevant code on any other CPU. This is currently handled by smpboot
    code which takes care of CPU-hotplug operations. Affinity here is
    a correctness constraint.
 
 2) Some kthreads _have_ to be affine to a specific set of CPUs and can't
    run anywhere else. The affinity is set through kthread_bind_mask()
    and the subsystem takes care by itself to handle CPU-hotplug
    operations. Affinity here is assumed to be a correctness constraint.
 
 3) Per-node kthreads _prefer_ to be affine to a specific NUMA node. This
    is not a correctness constraint but merely a preference in terms of
    memory locality. kswapd and kcompactd both fall into this category.
    The affinity is set manually like for any other task and CPU-hotplug
    is supposed to be handled by the relevant subsystem so that the task
    is properly reaffined whenever a given CPU from the node comes up.
    Also care should be taken so that the node affinity doesn't cross
    isolated (nohz_full) cpumask boundaries.
 
 4) Similar to the previous point except kthreads have a _preferred_
    affinity different than a node. Both RCU boost kthreads and RCU
    exp kworkers fall into this category as they refer to "RCU nodes"
    from a distinctly distributed tree.
 
 Currently the preferred affinity patterns (3 and 4) have at least 4
 identified users, with more or less success when it comes to handle
 CPU-hotplug operations and CPU isolation. Each of which do it in its own
 ad-hoc way.
 
 This is an infrastructure proposal to handle this with the following API
 changes:
 
 _ kthread_create_on_node() automatically affines the created kthread to
   its target node unless it has been set as per-cpu or bound with
   kthread_bind[_mask]() before the first wake-up.
 
 - kthread_affine_preferred() is a new function that can be called right
   after kthread_create_on_node() to specify a preferred affinity
   different than the specified node.
 
 When the preferred affinity can't be applied because the possible
 targets are offline or isolated (nohz_full), the kthread is affine
 to the housekeeping CPUs (which means to all online CPUs most of the
 time or only the non-nohz_full CPUs when nohz_full= is set).
 
 kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
 converted, along with a few old drivers.
 
 Summary of the changes:
 
 * Consolidate a bunch of ad-hoc implementations of kthread_run_on_cpu()
 
 * Introduce task_cpu_fallback_mask() that defines the default last
   resort affinity of a task to become nohz_full aware
 
 * Add some correctness check to ensure kthread_bind() is always called
   before the first kthread wake up.
 
 * Default affine kthread to its preferred node.
 
 * Convert kswapd / kcompactd and remove their halfway working ad-hoc
   affinity implementation
 
 * Implement kthreads preferred affinity
 
 * Unify kthread worker and kthread API's style
 
 * Convert RCU kthreads to the new API and remove the ad-hoc affinity
   implementation.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmeNf8gACgkQhSRUR1CO
 jHedQQ/+IxTjjqQiItzrq41TES2S0desHDq8lNJFb7rsR/DtKFyLx3s67cOYV+cM
 Yx54QHg2m/Fz4nXMQ7Po5ygOtJGCKBc5C5QQy7y0lVKeTQK+daDfEtBSa3oG7j3C
 u+E3tTY6qxkbCzymUyaKkHN4/ay2vLvjFS50luV7KMyI3x47Aji+t7VdCX4LCPP2
 eAwOALWD0+7qLJ/VF6gsmQLKA4Qx7PQAzBa3KSBmUN9UcN8Gk1bQHCTIQKDHP9LQ
 v8BXrNZtYX1o2+snNYpX2z6/ECjxkdwriOgqqZY5306hd9RAQ1u46Dx3byrIqjGn
 ULG/XQ2istPyhTqb/h+RbrobdOcwEUIeqk8hRRbBXE8bPpqUz9EMuaCMxWDbQjgH
 NTuKG4ifKJ/IqstkkuDkdOiByE/ysMmwqrTXgSnu2ITNL9yY3BEgFbvA95hgo42s
 f7QCxEfZb1MHcNEMENSMwM3xw5lLMGMpxVZcMQ3gLwyotMBRrhFZm1qZJG7TITYW
 IDIeCbH4JOMdQwLs3CcWTXio0N5/85NhRNFV+IDn96OrgxObgnMtV8QwNgjXBAJ5
 wGeJWt8s34W1Zo3qS9gEuVzEhW4XaxISQQMkHe8faKkK6iHmIB/VjSQikDwwUNQ/
 AspYj82RyWBCDZsqhiYh71kpxjvS6Xp0bj39Ce1sNsOnuksxKkQ=
 =g8In
 -----END PGP SIGNATURE-----

Merge tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks

Pull kthread updates from Frederic Weisbecker:
 "Kthreads affinity follow either of 4 existing different patterns:

   1) Per-CPU kthreads must stay affine to a single CPU and never
      execute relevant code on any other CPU. This is currently handled
      by smpboot code which takes care of CPU-hotplug operations.
      Affinity here is a correctness constraint.

   2) Some kthreads _have_ to be affine to a specific set of CPUs and
      can't run anywhere else. The affinity is set through
      kthread_bind_mask() and the subsystem takes care by itself to
      handle CPU-hotplug operations. Affinity here is assumed to be a
      correctness constraint.

   3) Per-node kthreads _prefer_ to be affine to a specific NUMA node.
      This is not a correctness constraint but merely a preference in
      terms of memory locality. kswapd and kcompactd both fall into this
      category. The affinity is set manually like for any other task and
      CPU-hotplug is supposed to be handled by the relevant subsystem so
      that the task is properly reaffined whenever a given CPU from the
      node comes up. Also care should be taken so that the node affinity
      doesn't cross isolated (nohz_full) cpumask boundaries.

   4) Similar to the previous point except kthreads have a _preferred_
      affinity different than a node. Both RCU boost kthreads and RCU
      exp kworkers fall into this category as they refer to "RCU nodes"
      from a distinctly distributed tree.

  Currently the preferred affinity patterns (3 and 4) have at least 4
  identified users, with more or less success when it comes to handle
  CPU-hotplug operations and CPU isolation. Each of which do it in its
  own ad-hoc way.

  This is an infrastructure proposal to handle this with the following
  API changes:

   - kthread_create_on_node() automatically affines the created kthread
     to its target node unless it has been set as per-cpu or bound with
     kthread_bind[_mask]() before the first wake-up.

   - kthread_affine_preferred() is a new function that can be called
     right after kthread_create_on_node() to specify a preferred
     affinity different than the specified node.

  When the preferred affinity can't be applied because the possible
  targets are offline or isolated (nohz_full), the kthread is affine to
  the housekeeping CPUs (which means to all online CPUs most of the time
  or only the non-nohz_full CPUs when nohz_full= is set).

  kswapd, kcompactd, RCU boost kthreads and RCU exp kworkers have been
  converted, along with a few old drivers.

  Summary of the changes:

   - Consolidate a bunch of ad-hoc implementations of
     kthread_run_on_cpu()

   - Introduce task_cpu_fallback_mask() that defines the default last
     resort affinity of a task to become nohz_full aware

   - Add some correctness check to ensure kthread_bind() is always
     called before the first kthread wake up.

   - Default affine kthread to its preferred node.

   - Convert kswapd / kcompactd and remove their halfway working ad-hoc
     affinity implementation

   - Implement kthreads preferred affinity

   - Unify kthread worker and kthread API's style

   - Convert RCU kthreads to the new API and remove the ad-hoc affinity
     implementation"

* tag 'kthread-for-6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks:
  kthread: modify kernel-doc function name to match code
  rcu: Use kthread preferred affinity for RCU exp kworkers
  treewide: Introduce kthread_run_worker[_on_cpu]()
  kthread: Unify kthread_create_on_cpu() and kthread_create_worker_on_cpu() automatic format
  rcu: Use kthread preferred affinity for RCU boost
  kthread: Implement preferred affinity
  mm: Create/affine kswapd to its preferred node
  mm: Create/affine kcompactd to its preferred node
  kthread: Default affine kthread to its preferred NUMA node
  kthread: Make sure kthread hasn't started while binding it
  sched,arm64: Handle CPU isolation on last resort fallback rq selection
  arm64: Exclude nohz_full CPUs from 32bits el0 support
  lib: test_objpool: Use kthread_run_on_cpu()
  kallsyms: Use kthread_run_on_cpu()
  soc/qman: test: Use kthread_run_on_cpu()
  arm/bL_switcher: Use kthread_run_on_cpu()
2025-01-21 17:10:05 -08:00
Jakub Kicinski ba0209bd18 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:

====================
ice: support FW Recovery Mode

Konrad Knitter says:

Enable update of card in FW Recovery Mode

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: support FW Recovery Mode
  devlink: add devl guard
  pldmfw: enable selected component update
====================

Link: https://patch.msgid.link/20250116212059.1254349-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-17 19:45:35 -08:00
Konrad Knitter 1de25c6b98 ice: support FW Recovery Mode
Recovery Mode is intended to recover from a fatal failure scenario in
which the device is not accessible to the host, meaning the firmware is
non-responsive.

The purpose of the Firmware Recovery Mode is to enable software tools to
update firmware and/or device configuration so the fatal error can be
resolved.

Recovery Mode Firmware supports a limited set of admin commands required
for NVM update.
Recovery Firmware does not support hardware interrupts so a polling mode
is used.

The driver will expose only the minimum set of devlink commands required
for the recovery of the adapter.

Using an appropriate NVM image, the user can recover the adapter using
the devlink flash API.

Prior to 4.20 E810 Adapter Recovery Firmware supports only the update
and erase of the "fw.mgmt" component.

E810 Adapter Recovery Firmware doesn't support selected preservation of
cards settings or identifiers.

The following command can be used to recover the adapter:

$ devlink dev flash <pci-address> <update-image.bin> component fw.mgmt
  overwrite settings overwrite identifier

Newer FW versions (4.20 or newer) supports update of "fw.undi" and
"fw.netlist" components.

$ devlink dev flash <pci-address> <update-image.bin>

Tested on Intel Corporation Ethernet Controller E810-C for SFP
FW revision 3.20 and 4.30.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Konrad Knitter <konrad.knitter@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-16 13:05:06 -08:00
Jakub Kicinski 2ee738e90e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc8).

Conflicts:

drivers/net/ethernet/realtek/r8169_main.c
  1f691a1fc4 ("r8169: remove redundant hwmon support")
  152d00a913 ("r8169: simplify setting hwmon attribute visibility")
https://lore.kernel.org/20250115122152.760b4e8d@canb.auug.org.au

Adjacent changes:

drivers/net/ethernet/broadcom/bnxt/bnxt.c
  152f4da05a ("bnxt_en: add support for rx-copybreak ethtool command")
  f0aa6a37a3 ("eth: bnxt: always recalculate features after XDP clearing, fix null-deref")

drivers/net/ethernet/intel/ice/ice_type.h
  50327223a8 ("ice: add lock to protect low latency interface")
  dc26548d72 ("ice: Fix quad registers read on E825")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-16 10:34:59 -08:00
Jakub Kicinski 413f0271f3 net: protect NAPI enablement with netdev_lock()
Wrap napi_enable() / napi_disable() with netdev_lock().
Provide the "already locked" flavor of the API.

iavf needs the usual adjustment. A number of drivers call
napi_enable() under a spin lock, so they have to be modified
to take netdev_lock() first, then spin lock then call
napi_enable_locked().

Protecting napi_enable() implies that napi->napi_id is protected
by netdev_lock().

Acked-by: Francois Romieu <romieu@fr.zoreil.com> # via-velocity
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski 1b23cdbd2b net: protect netdev->napi_list with netdev_lock()
Hold netdev->lock when NAPIs are getting added or removed.
This will allow safe access to NAPI instances of a net_device
without rtnl_lock.

Create a family of helpers which assume the lock is already taken.
Switch iavf to them, as it makes extensive use of netdev->lock,
already.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250115035319.559603-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:34 -08:00
Jakub Kicinski ebda2f0bbd net: add netdev_lock() / netdev_unlock() helpers
Add helpers for locking the netdev instance, use it in drivers
and the shaper code. This will make grepping for the lock usage
much easier, as we extend the lock to cover more fields.

Reviewed-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Link: https://patch.msgid.link/20250115035319.559603-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-15 19:13:33 -08:00
Karol Kolacinski 914639464b ice: Add in/out PTP pin delays
HW can have different input/output delays for each of the pins.

Currently, only E82X adapters have delay compensation based on TSPLL
config and E810 adapters have constant 1 ms compensation, both cases
only for output delays and the same one for all pins.

E825 adapters have different delays for SDP and other pins. Those
delays are also based on direction and input delays are different than
output ones. This is the main reason for moving delays to pin
description structure.

Add a field in ice_ptp_pin_desc structure to reflect that. Delay values
are based on approximate calculations of HW delays based on HW spec.

Implement external timestamp (input) delay compensation.

Remove existing definitions and wrappers for periodic output propagation
delays.

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sunitha Mekala <sunithax.d.mekala@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller ef9a64c072 ice: implement low latency PHY timer updates
Programming the PHY registers in preparation for an increment value change
or a timer adjustment on E810 requires issuing Admin Queue commands for
each PHY register. It has been found that the firmware Admin Queue
processing occasionally has delays of tens or rarely up to hundreds of
milliseconds. This delay cascades to failures in the PTP applications which
depend on these updates being low latency.

Consider a standard PTP profile with a sync rate of 16 times per second.
This means there is ~62 milliseconds between sync messages. A complete
cycle of the PTP algorithm

1) Sync message (with Tx timestamp) from source
2) Follow-up message from source
3) Delay request (with Tx timestamp) from sink
4) Delay response (with Rx timestamp of request) from source
5) measure instantaneous clock offset
6) request time adjustment via CLOCK_ADJTIME systemcall

The Tx timestamps have a default maximum timeout of 10 milliseconds. If we
assume that the maximum possible time is used, this leaves us with ~42
milliseconds of processing time for a complete cycle.

The CLOCK_ADJTIME system call is synchronous and will block until the
driver completes its timer adjustment or frequency change.

If the writes to prepare the PHY timers get hit by a latency spike of 50
milliseconds, then the PTP application will be delayed past the point where
the next cycle should start. Packets from the next cycle may have already
arrived and are waiting on the socket.

In particular, LinuxPTP ptp4l may start complaining about missing an
announce message from the source, triggering a fault. In addition, the
clockcheck logic it uses may trigger. This clockcheck failure occurs
because the timestamp captured by hardware is compared against a reading of
CLOCK_MONOTONIC. It is assumed that the time when the Rx timestamp is
captured and the read from CLOCK_MONOTONIC are relatively close together.
This is not the case if there is a significant delay to processing the Rx
packet.

Newer firmware supports programming the PHY registers over a low latency
interface which bypasses the Admin Queue. Instead, software writes to the
REG_LL_PROXY_L and REG_LL_PROXY_H registers. Firmware reads these registers
and then programs the PHY timers.

Implement functions to use this interface when available to program the PHY
timers instead of using the Admin Queue. This avoids the Admin Queue
latency and ensures that adjustments happen within acceptable latency
bounds.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller a5c69d45df ice: check low latency PHY timer update firmware capability
Newer versions of firmware support programming the PHY timer via the low
latency interface exposed over REG_LL_PROXY_L and REG_LL_PROXY_H. Add
support for checking the device capabilities for this feature.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller 50327223a8 ice: add lock to protect low latency interface
Newer firmware for the E810 devices support a 'low latency' interface to
interact with the PHY without using the Admin Queue. This is interacted
with via the REG_LL_PROXY_L and REG_LL_PROXY_H registers.

Currently, this interface is only used for Tx timestamps. There are two
different mechanisms, including one which uses an interrupt for firmware to
signal completion. However, these two methods are mutually exclusive, so no
synchronization between them was necessary.

This low latency interface is being extended in future firmware to support
also programming the PHY timers. Use of the interface for PHY timers will
need synchronization to ensure there is no overlap with a Tx timestamp.

The interrupt-based response complicates the locking somewhat. We can't use
a simple spinlock. This would require being acquired in
ice_ptp_req_tx_single_tstamp, and released in
ice_ptp_complete_tx_single_tstamp. The ice_ptp_req_tx_single_tstamp
function is called from the threaded IRQ, and the
ice_ptp_complete_tx_single_stamp is called from the low latency IRQ, so we
would need to acquire the lock with IRQs disabled.

To handle this, we'll use a wait queue along with
wait_event_interruptible_locked_irq in the update flows which don't use the
interrupt.

The interrupt flow will acquire the wait queue lock, set the
ATQBAL_FLAGS_INTR_IN_PROGRESS, and then initiate the firmware low latency
request, and unlock the wait queue lock.

Upon receipt of the low latency interrupt, the lock will be acquired, the
ATQBAL_FLAGS_INTR_IN_PROGRESS bit will be cleared, and the firmware
response will be captured, and wake_up_locked() will be called on the wait
queue.

The other flows will use wait_event_interruptible_locked_irq() to wait
until the ATQBAL_FLAGS_INTR_IN_PROGRESS is clear. This function checks the
condition under lock, but does not hold the lock while waiting. On return,
the lock is held, and a return of zero indicates we hold the lock and the
in-progress flag is not set.

This will ensure that threads which need to use the low latency interface
will sleep until they can acquire the lock without any pending low latency
interrupt flow interfering.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller 5b15b1f144 ice: rename TS_LL_READ* macros to REG_LL_PROXY_H_*
The TS_LL_READ macros are used as part of the low latency Tx timestamp
interface. A future firmware extension will add support for performing PHY
timer updates over this interface. Using TS_LL_READ as the prefix for these
macros will be confusing once the interface is used for other purposes.

Rename the macros, using the prefix REG_LL_PROXY_H, to better clarify that
this is for the low latency interface.
Additionally add macros for PF_SB_ATQBAH and PF_SB_ATQBAL registers to
better clarify content of this registers as PF_SB_ATQBAH contain low
part of Tx timestamp

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Jacob Keller 95aca43b4a ice: use read_poll_timeout_atomic in ice_read_phy_tstamp_ll_e810
The ice_read_phy_tstamp_ll_e810 function repeatedly reads the PF_SB_ATQBAL
register until the TS_LL_READ_TS bit is cleared. This is a perfect
candidate for using rd32_poll_timeout. However, the default implementation
uses a sleep-based wait. Use read_poll_timeout_atomic macro which is based
on the non-sleeping implementation and use it to replace the loop reading
in the ice_read_phy_tstamp_ll_e810 function.

Co-developed-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Anton Nadezhdin <anton.nadezhdin@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
R Sundar 4c9f13a654 ice: use string choice helpers
Use string choice helpers for better readability.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@inria.fr>
Closes: https://lore.kernel.org/r/202410121553.SRNFzc2M-lkp@intel.com/
Signed-off-by: R Sundar <prosunofficial@gmail.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Konrad Knitter 85d6164ec5 ice: add fw and port health reporters
Firmware generates events for global events or port specific events.

Driver shall subscribe for health status events from firmware on supported
FW versions >= 1.7.6.
Driver shall expose those under specific health reporter, two new
reporters are introduced:
- FW health reporter shall represent global events (problems with the
image, recovery mode);
- Port health reporter shall represent port-specific events (module
failure).

Firmware only reports problems when those are detected, it does not store
active fault list.
Driver will hold only last global and last port-specific event.
Driver will report all events via devlink health report,
so in case of multiple events of the same source they can be reviewed
using devlink autodump feature.

$ devlink health

pci/0000:b1:00.3:
  reporter fw
    state healthy error 0 recover 0 auto_dump true
  reporter port
    state error error 1 recover 0 last_dump_date 2024-03-17
	last_dump_time 09:29:29 auto_dump true

$ devlink health diagnose pci/0000:b1:00.3 reporter port

  Syndrome: 262
  Description: Module is not present.
  Possible Solution: Check that the module is inserted correctly.
  Port Number: 0

Tested on Intel Corporation Ethernet Controller E810-C for SFP

Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Co-developed-by: Sharon Haroni <sharon.haroni@intel.com>
Signed-off-by: Sharon Haroni <sharon.haroni@intel.com>
Co-developed-by: Nicholas Nunley <nicholas.d.nunley@intel.com>
Signed-off-by: Nicholas Nunley <nicholas.d.nunley@intel.com>
Co-developed-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Brett Creeley <brett.creeley@intel.com>
Signed-off-by: Konrad Knitter <konrad.knitter@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Michal Swiatkowski e81e1d79a9 ice: add recipe priority check in search
The new recipe should be added even if exactly the same recipe already
exists with different priority.

Example use case is when the rule is being added from TC tool context.
It should has the highest priority, but if the recipe already exists
the rule will inherit it priority. It can lead to the situation when
the rule added from TC tool has lower priority than expected.

The solution is to check the recipe priority when trying to find
existing one.

Previous recipe is still useful. Example:
RID 8 -> priority 4
RID 10 -> priority 7

The difference is only in priority rest is let's say eth + mac +
direction.

Adding ARP + MAC_A + RX on RID 8, forward to VF0_VSI
After that IP + MAC_B + RX on RID 10 (from TC tool), forward to PF0

Both will work.

In case of adding ARP + MAC_A + RX on RID 8, forward to VF0_VSI
ARP + MAC_A + RX on RID 10, forward to PF0.

Only second one will match, but this is expected.

Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Sujai Buvaneswaran <sujai.buvaneswaran@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Przemek Kitszel fb59a520bb ice: ice_probe: init ice_adapter after HW init
Move ice_adapter initialization to be after HW init, so it could use HW
capabilities, like number of PFs. This is needed for devlink-resource
based RSS LUT size management for PF/VF (not in this series).

Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Przemek Kitszel 5d5d9c2c0f ice: minor: rename goto labels from err to unroll
Clean up goto labels after previous commit, to conform to single naming
scheme in ice_probe() and ice_init_dev().

Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Przemek Kitszel 4d3f59bfa2 ice: split ice_init_hw() out from ice_init_dev()
Split ice_init_hw() call out from ice_init_dev(). Such move enables
pulling the former to be even earlier on call path, what would enable
moving ice_adapter init to be between the two (in subsequent commit).
Such move enables ice_adapter to know about number of PFs.

Do the same for ice_deinit_hw(), so the init and deinit calls could
be easily mirrored.
Next commit will rename unrelated goto labels to unroll prefix.

Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:34 -08:00
Przemek Kitszel c37dd67c42 ice: c827: move wait for FW to ice_init_hw()
Move call to ice_wait_for_fw() from ice_init_dev() into ice_init_hw(),
where it fits better. This requires also to move ice_wait_for_fw()
to ice_common.c.

ice_is_pf_c827() is now used only in ice_common.c, so it could be static.

CC: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Marcin Szycik <marcin.szycik@linux.intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-14 14:37:33 -08:00
Jakub Kicinski afc664987a eth: iavf: extend the netdev_lock usage
iavf uses the netdev->lock already to protect shapers.
In an upcoming series we'll try to protect NAPI instances
with netdev->lock.

We need to modify the protection a bit. All NAPI related
calls in the driver need to be consistently under the lock.
This will allow us to easily switch to a "we already hold
the lock" NAPI API later.

register_netdevice(), OTOH, must not be called under
the netdev_lock() as we do not intend to have an
"already locked" version of this call.

Link: https://patch.msgid.link/20250111071339.3709071-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-13 19:08:30 -08:00
Karol Kolacinski 258f5f9058 ice: Add correct PHY lane assignment
Driver always naively assumes, that for PTP purposes, PHY lane to
configure is corresponding to PF ID.

This is not true for some port configurations, e.g.:
- 2x50G per quad, where lanes used are 0 and 2 on each quad, but PF IDs
  are 0 and 1
- 100G per quad on 2 quads, where lanes used are 0 and 4, but PF IDs are
  0 and 1

Use correct PHY lane assignment by getting and parsing port options.
This is read from the NVM by the FW and provided to the driver with
the indication of active port split.

Remove ice_is_muxed_topo(), which is no longer needed.

Fixes: 4409ea1726 ("ice: Adjust PTP init for 2x50G E825C devices")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Arkadiusz Kubalewski <Arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-13 09:59:14 -08:00
Karol Kolacinski 2e60560f1e ice: Fix ETH56G FC-FEC Rx offset value
Fix ETH56G FC-FEC incorrect Rx offset value by changing it from -255.96
to -469.26 ns.

Those values are derived from HW spec and reflect internal delays.
Hex value is a fixed point representation in Q23.9 format.

Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-13 09:59:14 -08:00
Karol Kolacinski dc26548d72 ice: Fix quad registers read on E825
Quad registers are read/written incorrectly. E825 devices always use
quad 0 address and differentiate between the PHYs by changing SBQ
destination device (phy_0 or phy_0_peer).

Add helpers for reading/writing PTP registers shared per quad and use
correct quad address and SBQ destination device based on port.

Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-13 09:59:14 -08:00
Karol Kolacinski d79c304c76 ice: Fix E825 initialization
Current implementation checks revision of all PHYs on all PFs, which is
incorrect and may result in initialization failure. Check only the
revision of the current PHY.

Fixes: 7cab44f1c3 ("ice: Introduce ETH56G PHY model for E825C products")
Reviewed-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com>
Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-13 09:59:14 -08:00
Jakub Kicinski 14ea4cd1b1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc7).

Conflicts:
  a42d71e322 ("net_sched: sch_cake: Add drop reasons")
  737d4d91d3 ("sched: sch_cake: add bounds checks to host bulk flow fairness counts")

Adjacent changes:

drivers/net/ethernet/meta/fbnic/fbnic.h
  3a856ab347 ("eth: fbnic: add IRQ reuse support")
  95978931d5 ("eth: fbnic: Revert "eth: fbnic: Add hardware monitoring support via HWMON interface"")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-09 16:11:47 -08:00
Frederic Weisbecker b04e317b52 treewide: Introduce kthread_run_worker[_on_cpu]()
kthread_create() creates a kthread without running it yet. kthread_run()
creates a kthread and runs it.

On the other hand, kthread_create_worker() creates a kthread worker and
runs it.

This difference in behaviours is confusing. Also there is no way to
create a kthread worker and affine it using kthread_bind_mask() or
kthread_affine_preferred() before starting it.

Consolidate the behaviours and introduce kthread_run_worker[_on_cpu]()
that behaves just like kthread_run(). kthread_create_worker[_on_cpu]()
will now only create a kthread worker without starting it.

Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
2025-01-08 18:15:03 +01:00
Dr. David Alan Gilbert 605237372a intel/fm10k: Remove unused fm10k_iov_msg_mac_vlan_pf
fm10k_iov_msg_mac_vlan_pf() has been unused since 2017's
commit 1f5c27e528 ("fm10k: use the MAC/VLAN queue for VF<->PF MAC/VLAN
requests")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-16-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:16:02 -08:00
Joe Damato b65969856d igc: Link queues to NAPI instances
Link queues to NAPI instances via netdev-genl API so that users can
query this information with netlink. Handle a few cases in the driver:
  1. Link/unlink the NAPIs when XDP is enabled/disabled
  2. Handle IGC_FLAG_QUEUE_PAIRS enabled and disabled

Example output when IGC_FLAG_QUEUE_PAIRS is enabled:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'

[{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'rx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'tx'},
 {'id': 2, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'},
 {'id': 3, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}]

Since IGC_FLAG_QUEUE_PAIRS is enabled, you'll note that the same NAPI ID
is present for both rx and tx queues at the same index, for example
index 0:

{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'tx'},

To test IGC_FLAG_QUEUE_PAIRS disabled, a test system was booted using
the grub command line option "maxcpus=2" to force
igc_set_interrupt_capability to disable IGC_FLAG_QUEUE_PAIRS.

Example output when IGC_FLAG_QUEUE_PAIRS is disabled:

$ lscpu | grep "On-line CPU"
On-line CPU(s) list:      0,2

$ ethtool -l enp86s0  | tail -5
Current hardware settings:
RX:		n/a
TX:		n/a
Other:		1
Combined:	2

$ cat /proc/interrupts  | grep enp
 144: [...] enp86s0
 145: [...] enp86s0-rx-0
 146: [...] enp86s0-rx-1
 147: [...] enp86s0-tx-0
 148: [...] enp86s0-tx-1

1 "other" IRQ, and 2 IRQs for each of RX and Tx, so we expect netlink to
report 4 IRQs with unique NAPI IDs:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 2}'
[{'id': 8196, 'ifindex': 2, 'irq': 148},
 {'id': 8195, 'ifindex': 2, 'irq': 147},
 {'id': 8194, 'ifindex': 2, 'irq': 146},
 {'id': 8193, 'ifindex': 2, 'irq': 145}]

Now we examine which queues these NAPIs are associated with, expecting
that since IGC_FLAG_QUEUE_PAIRS is disabled each RX and TX queue will
have its own NAPI instance:

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump queue-get --json='{"ifindex": 2}'
[{'id': 0, 'ifindex': 2, 'napi-id': 8193, 'type': 'rx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8194, 'type': 'rx'},
 {'id': 0, 'ifindex': 2, 'napi-id': 8195, 'type': 'tx'},
 {'id': 1, 'ifindex': 2, 'napi-id': 8196, 'type': 'tx'}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-15-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:16:02 -08:00
Joe Damato 1a63399c13 igc: Link IRQs to NAPI instances
Link IRQs to NAPI instances via netdev-genl API so that users can query
this information with netlink.

Compare the output of /proc/interrupts (noting that IRQ 128 is the
"other" IRQ which does not appear to have a NAPI instance):

$ cat /proc/interrupts | grep enp86s0 | cut --delimiter=":" -f1
 128
 129
 130
 131
 132

The output from netlink shows the mapping of NAPI IDs to IRQs (again
noting that 128 is absent as it is the "other" IRQ):

$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/netdev.yaml \
                         --dump napi-get --json='{"ifindex": 2}'

[{'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8196,
  'ifindex': 2,
  'irq': 132},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8195,
  'ifindex': 2,
  'irq': 131},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8194,
  'ifindex': 2,
  'irq': 130},
 {'defer-hard-irqs': 0,
  'gro-flush-timeout': 0,
  'id': 8193,
  'ifindex': 2,
  'irq': 129}]

Signed-off-by: Joe Damato <jdamato@fastly.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-14-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:16:02 -08:00
Aleksandr Loktionov 07af482e64 i40e: add ability to reset VF for Tx and Rx MDD events
Implement "mdd-auto-reset-vf" priv-flag to handle Tx and Rx MDD events for VFs.
This flag is also used in other network adapters like ICE.

Usage:
- "on"  - The problematic VF will be automatically reset
	  if a malformed descriptor is detected.
- "off" - The problematic VF will be disabled.

In cases where a VF sends malformed packets classified as malicious, it can
cause the Tx queue to freeze, rendering it unusable for several minutes. When
an MDD event occurs, this new implementation allows for a graceful VF reset to
quickly restore operational state.

Currently, VF queues are disabled if an MDD event occurs. This patch adds the
ability to reset the VF if a Tx or Rx MDD event occurs. It also includes MDD
event logging throttling to avoid dmesg pollution and unifies the format of
Tx and Rx MDD messages.

Note: Standard message rate limiting functions like dev_info_ratelimited()
do not meet our requirements. Custom rate limiting is implemented,
please see the code for details.

Co-developed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Signed-off-by: Jan Sokolowski <jan.sokolowski@intel.com>
Co-developed-by: Padraig J Connolly <padraig.j.connolly@intel.com>
Signed-off-by: Padraig J Connolly <padraig.j.connolly@intel.com>
Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Reviewed-by: Michal Schmidt <mschmidt@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-13-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:16:02 -08:00
Yue Haibing 35f715cb77 ixgbevf: Fix passing 0 to ERR_PTR in ixgbevf_run_xdp()
ixgbevf_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
ixgbevf_clean_rx_irq(). Remove this error pointer handing instead use
plain int return value.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-12-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:15:59 -08:00
Yue Haibing c824125cbb ixgbe: Fix passing 0 to ERR_PTR in ixgbe_run_xdp()
ixgbe_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
ixgbe_clean_rx_irq(). Remove this error pointer handing instead use
plain int return value.

Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-11-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:15:57 -08:00
Yue Haibing 8ae94669b1 igb: Fix passing 0 to ERR_PTR in igb_run_xdp()
igb_run_xdp() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
igb_clean_rx_irq(). Remove this error pointer handing instead use plain
int return value.

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> (A Contingent Worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-10-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:15:54 -08:00
Yue Haibing 8b6237e1f4 igc: Fix passing 0 to ERR_PTR in igc_xdp_run_prog()
igc_xdp_run_prog() converts customed xdp action to a negative error code
with the sk_buff pointer type which be checked with IS_ERR in
igc_clean_rx_irq(). Remove this error pointer handing instead use plain
int return value to fix this smatch warnings:

drivers/net/ethernet/intel/igc/igc_main.c:2533
 igc_xdp_run_prog() warn: passing zero to 'ERR_PTR'

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-9-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:15:51 -08:00
Song Yoong Siang 484d3675f2 igc: Allow hot-swapping XDP program
Currently, the driver would always close and reopen the network interface
when setting/removing the XDP program, regardless of the presence of XDP
resources. This could cause unnecessary disruptions.

To avoid this, introduces a check to determine if there is a need to
close and reopen the interface, allowing for seamless hot-swapping of
XDP programs.

Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: Avigail Dahan <avigailx.dahan@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-8-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:09 -08:00
Sriram Yagnaraman f8e284a02a igb: Add AF_XDP zero-copy Tx support
Add support for AF_XDP zero-copy transmit path.

A new TX buffer type IGB_TYPE_XSK is introduced to indicate that the Tx
frame was allocated from the xsk buff pool, so igb_clean_tx_ring() and
igb_clean_tx_irq() can clean the buffers correctly based on type.

igb_xmit_zc() performs the actual packet transmit when AF_XDP zero-copy is
enabled. We share the TX ring between slow path, XDP and AF_XDP
zero-copy, so we use the netdev queue lock to ensure mutual exclusion.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Set olinfo_status in igb_xmit_zc() so that frames are transmitted,
       Use READ_ONCE() for xsk_pool and check Tx disabled and carrier in
       igb_xmit_zc(), Add FIXME for RS bit]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-7-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:09 -08:00
Sriram Yagnaraman 2c6196013f igb: Add AF_XDP zero-copy Rx support
Add support for AF_XDP zero-copy receive path.

When AF_XDP zero-copy is enabled, the rx buffers are allocated from the
xsk buff pool using igb_alloc_rx_buffers_zc().

Use xsk_pool_get_rx_frame_size() to set SRRCTL rx buf size when zero-copy
is enabled.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Port to v6.12 and provide napi_id for xdp_rxq_info_reg(),
       RCT, remove NETDEV_XDP_ACT_XSK_ZEROCOPY, update NTC handling,
       READ_ONCE() xsk_pool, likelyfy for XDP_REDIRECT case]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-6-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:08 -08:00
Kurt Kanzenbach 0fe7cce600 igb: Add XDP finalize and stats update functions
Move XDP finalize and Rx statistics update into separate functions. This
way, they can be reused by the XDP and XDP/ZC code later.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-5-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:08 -08:00
Sriram Yagnaraman 80f6ccf9f1 igb: Introduce XSK data structures and helpers
Add the following ring flag:
- IGB_RING_FLAG_TX_DISABLED (when xsk pool is being setup)

Add a xdp_buff array for use with XSK receive batch API, and a pointer
to xsk_pool in igb_adapter.

Add enable/disable functions for TX and RX rings.
Add enable/disable functions for XSK pool.
Add xsk wakeup function.

None of the above functionality will be active until
NETDEV_XDP_ACT_XSK_ZEROCOPY is advertised in netdev->xdp_features.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Add READ/WRITE_ONCE(), synchronize_net(),
       remove IGB_RING_FLAG_AF_XDP_ZC]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-4-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:08 -08:00
Sriram Yagnaraman 6dc75fc230 igb: Introduce igb_xdp_is_enabled()
Introduce igb_xdp_is_enabled() to check if an XDP program is assigned to
the device. Use that wherever xdp_prog is read and evaluated.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Split patches and use READ_ONCE()]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-3-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:08 -08:00
Sriram Yagnaraman f70b864ccc igb: Remove static qualifiers
Remove static qualifiers on the following functions to be able to call
from XSK specific file that is added in the later patches:
- igb_xdp_tx_queue_mapping()
- igb_xdp_ring_update_tail()
- igb_clean_tx_ring()
- igb_clean_rx_ring()
- igb_xdp_xmit_back()
- igb_process_skb_fields()

While at it, inline igb_xdp_tx_queue_mapping() and
igb_xdp_ring_update_tail(). These functions are small enough and used in
XDP hot paths.

Signed-off-by: Sriram Yagnaraman <sriram.yagnaraman@est.tech>
[Kurt: Split patches, inline small XDP functions]
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250106221929.956999-2-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 18:14:08 -08:00
Dr. David Alan Gilbert 4ce1aeece9 ixgbevf: Remove unused ixgbevf_hv_mbx_ops
The const struct ixgbevf_hv_mbx_ops was added in 2016 as part of
commit c6d45171d7 ("ixgbevf: Support Windows hosts (Hyper-V)")

but has remained unused.

The functions it references are still referenced elsewhere.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Link: https://patch.msgid.link/20250105122847.27341-1-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-07 17:43:47 -08:00
En-Wei Wu bd2776e39c igc: return early when failing to read EECD register
When booting with a dock connected, the igc driver may get stuck for ~40
seconds if PCIe link is lost during initialization.

This happens because the driver access device after EECD register reads
return all F's, indicating failed reads. Consequently, hw->hw_addr is set
to NULL, which impacts subsequent rd32() reads. This leads to the driver
hanging in igc_get_hw_semaphore_i225(), as the invalid hw->hw_addr
prevents retrieving the expected value.

To address this, a validation check and a corresponding return value
catch is added for the EECD register read result. If all F's are
returned, indicating PCIe link loss, the driver will return -ENXIO
immediately. This avoids the 40-second hang and significantly improves
boot time when using a dock with an igc NIC.

Log before the patch:
[    0.911913] igc 0000:70:00.0: enabling device (0000 -> 0002)
[    0.912386] igc 0000:70:00.0: PTM enabled, 4ns granularity
[    1.571098] igc 0000:70:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
[   43.449095] igc_get_hw_semaphore_i225: igc 0000:70:00.0 (unnamed net_device) (uninitialized): Driver can't access device - SMBI bit is set.
[   43.449186] igc 0000:70:00.0: probe with driver igc failed with error -13
[   46.345701] igc 0000:70:00.0: enabling device (0000 -> 0002)
[   46.345777] igc 0000:70:00.0: PTM enabled, 4ns granularity

Log after the patch:
[    1.031000] igc 0000:70:00.0: enabling device (0000 -> 0002)
[    1.032097] igc 0000:70:00.0: PTM enabled, 4ns granularity
[    1.642291] igc 0000:70:00.0 (unnamed net_device) (uninitialized): PCIe link lost, device now detached
[    5.480490] igc 0000:70:00.0: enabling device (0000 -> 0002)
[    5.480516] igc 0000:70:00.0: PTM enabled, 4ns granularity

Fixes: ab40561268 ("igc: Add NVM support")
Cc: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Signed-off-by: En-Wei Wu <en-wei.wu@canonical.com>
Reviewed-by: Vitaly Lifshits <vitaly.lifshits@intel.com>
Tested-by: Mor Bar-Gabay <morx.bar.gabay@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-07 09:01:15 -08:00
Przemyslaw Korba 6c5b989116 ice: fix incorrect PHY settings for 100 GB/s
ptp4l application reports too high offset when ran on E823 device
with a 100GB/s link. Those values cannot go under 100ns, like in a
working case when using 100 GB/s cable.

This is due to incorrect frequency settings on the PHY clocks for
100 GB/s speed. Changes are introduced to align with the internal
hardware documentation, and correctly initialize frequency in PHY
clocks with the frequency values that are in our HW spec.

To reproduce the issue run ptp4l as a Time Receiver on E823 device,
and observe the offset, which will never approach values seen
in the PTP working case.

Reproduction output:
ptp4l -i enp137s0f3 -m -2 -s -f /etc/ptp4l_8275.conf
ptp4l[5278.775]: master offset      12470 s2 freq  +41288 path delay -3002
ptp4l[5278.837]: master offset      10525 s2 freq  +39202 path delay -3002
ptp4l[5278.900]: master offset     -24840 s2 freq  -20130 path delay -3002
ptp4l[5278.963]: master offset      10597 s2 freq  +37908 path delay -3002
ptp4l[5279.025]: master offset       8883 s2 freq  +36031 path delay -3002
ptp4l[5279.088]: master offset       7267 s2 freq  +34151 path delay -3002
ptp4l[5279.150]: master offset       5771 s2 freq  +32316 path delay -3002
ptp4l[5279.213]: master offset       4388 s2 freq  +30526 path delay -3002
ptp4l[5279.275]: master offset     -30434 s2 freq  -28485 path delay -3002
ptp4l[5279.338]: master offset     -28041 s2 freq  -27412 path delay -3002
ptp4l[5279.400]: master offset       7870 s2 freq  +31118 path delay -3002

Fixes: 3a7496234d ("ice: implement basic E822 PTP support")
Reviewed-by: Milena Olech <milena.olech@intel.com>
Signed-off-by: Przemyslaw Korba <przemyslaw.korba@intel.com>
Tested-by: Rinitha S <sx.rinitha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-07 09:01:15 -08:00
Arkadiusz Kubalewski 65104599b3 ice: fix max values for dpll pin phase adjust
Mask admin command returned max phase adjust value for both input and
output pins. Only 31 bits are relevant, last released data sheet wrongly
points that 32 bits are valid - see [1] 3.2.6.4.1 Get CCU Capabilities
Command for reference. Fix of the datasheet itself is in progress.

Fix the min/max assignment logic, previously the value was wrongly
considered as negative value due to most significant bit being set.

Example of previous broken behavior:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml \
--do pin-get --json '{"id":1}'| grep phase-adjust
 'phase-adjust': 0,
 'phase-adjust-max': 16723,
 'phase-adjust-min': -16723,

Correct behavior with the fix:
$ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/dpll.yaml \
--do pin-get --json '{"id":1}'| grep phase-adjust
 'phase-adjust': 0,
 'phase-adjust-max': 2147466925,
 'phase-adjust-min': -2147466925,

[1] https://cdrdv2.intel.com/v1/dl/getContent/613875?explicitVersion=true

Fixes: 90e1c90750 ("ice: dpll: implement phase related callbacks")
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-01-07 09:01:15 -08:00
Dr. David Alan Gilbert c758890813 igc: Remove unused igc_read/write_pcie_cap_reg
The last uses of igc_read_pcie_cap_reg() and igc_write_pcie_cap_reg()
were removed in 2019 by
commit 16ecd8d9af ("igc: Remove the obsolete workaround")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102174142.200700-4-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:32:44 -08:00
Dr. David Alan Gilbert 121c3c6bc6 igc: Remove unused igc_read/write_pci_cfg wrappers
igc_read_pci_cfg() and igc_write_pci_cfg were added in 2018 as part of
commit 146740f9ab ("igc: Add support for PF")
but have remained unused.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102174142.200700-3-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:32:44 -08:00
Dr. David Alan Gilbert b37dba891b igc: Remove unused igc_acquire/release_nvm
igc_acquire_nvm() and igc_release_nvm() were added in 2018 as part of
commit ab40561268 ("igc: Add NVM support")

but never used.

Remove them.

The igc_1225.c has it's own specific implementations.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102174142.200700-2-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:32:44 -08:00
Dr. David Alan Gilbert 47ea5d4e6f i40e: Remove unused i40e_dcb_hw_get_num_tc
The last useof i40e_dcb_hw_get_num_tc() was removed in 2022 by
commit fe20371578 ("Revert "i40e: Fix reset bw limit when DCB enabled
with 1 TC"")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-10-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:49 -08:00
Dr. David Alan Gilbert d424b93f35 i40e: Remove unused i40e_asq_send_command_v2
i40e_asq_send_command_v2() was added in 2022 by
commit 74073848b0 ("i40e: Add new versions of send ASQ command
functions")
but hasn't been used.

Remove it.

(The _atomic_v2 version of the function is used, so leave it).

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-9-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert a324484ac8 i40e: Remove unused i40e_commit_partition_bw_setting
i40e_commit_partition_bw_setting() was added in 2017 by
commit 4fc8c67639 ("i40e: genericize the partition bandwidth control")
but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-8-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert 38dfb07d9a i40e: Remove unused i40e_del_filter
The last use of i40e_del_filter() was removed in 2016 by
commit 9569a9a454 ("i40e: when adding or removing MAC filters, correctly
handle VLANs")

Remove it.

Fix up a comment that referenced it.

Note: The __ version of this function is still used.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-7-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert 3eb24a9e0a i40e: Remove unused i40e_get_cur_guaranteed_fd_count
The last use of i40e_get_cur_guaranteed_fd_count() was removed in 2015 by
commit 04294e38a4 ("i40e: FD filters flush policy changes")

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-6-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert 81d6bb2012 i40e: Deadcode profile code
i40e_add_pinfo_to_list() was added in 2017 by
commit 1d5c960c5e ("i40e: new AQ commands")

i40e_find_section_in_profile() was added in 2019 by
commit cdc594e003 ("i40e: Implement DDP support in i40e driver")

Neither have been used.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-5-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert 8cc51e28ec i40e: Remove unused i40e_(read|write)_phy_register
i40e_read_phy_register() and i40e_write_phy_register() were added in
2016 by
commit f62ba91458 ("i40e: Add functions which apply correct PHY access
method for read and write operation")

but haven't been used.

Remove them.

(There are more specific _clause* variants of these functions
that are still used.)

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-4-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:48 -08:00
Dr. David Alan Gilbert 39cabb01d2 i40e: Remove unused i40e_blink_phy_link_led
i40e_blink_phy_link_led() was added in 2016 by
commit fd077cd339 ("i40e: Add functions to blink led on 10GBaseT PHY")

but hasn't been used.

Remove it.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-3-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:47 -08:00
Dr. David Alan Gilbert 59ec698d01 i40e: Deadcode i40e_aq_*
i40e_aq_add_mirrorrule(), i40e_aq_delete_mirrorrule() and
i40e_aq_set_vsi_vlan_promisc() were added in 2016 by
commit 7bd6875bef ("i40e: APIs to Add/remove port mirroring rules")
but haven't been used.

They were the last user of i40e_mirrorrule_op().

i40e_aq_rearrange_nvm() was added in 2018 by
commit f05798b4ff ("i40e: Add AQ command for rearrange NVM structure")
but hasn't been used.

i40e_aq_restore_lldp() was added in 2019 by
commit c65e78f87f ("i40e: Further implementation of LLDP")
but hasn't been used.

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Link: https://patch.msgid.link/20250102173717.200359-2-linux@treblig.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-06 13:31:47 -08:00
Piotr Kwapulinski 4c44b450c6 ixgbevf: Add support for Intel(R) E610 device
Add support for Intel(R) E610 Series of network devices. The E610
is based on X550 but adds firmware managed link, enhanced security
capabilities and support for updated server manageability

Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:05 -08:00
Piotr Kwapulinski 4600cdf9f5 ixgbe: Enable link management in E610 device
Add high level link management support for E610 device. Enable the
following features:
- driver load
- bring up network interface
- IP address assignment
- pass traffic
- show statistics (e.g. via ethtool)
- disable network interface
- driver unload

Co-developed-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Signed-off-by: Carolyn Wyborny <carolyn.wyborny@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:03 -08:00
Piotr Kwapulinski 34b4157707 ixgbe: Clean up the E610 link management related code
Required for enabling the link management in E610 device.

Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:03 -08:00
Piotr Kwapulinski a0834bd521 ixgbe: Add ixgbe_x540 multiple header inclusion protection
Required to adopt x540 specific functions by E610 device.

Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:02 -08:00
Piotr Kwapulinski e5b132b4f4 ixgbe: Add support for EEPROM dump in E610 device
Add low level support for EEPROM dump for the specified network device.

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:01 -08:00
Piotr Kwapulinski d2483ebc9d ixgbe: Add support for NVM handling in E610 device
Add low level support for accessing NVM in E610 device. NVM operations are
handled via the Admin Command Interface. Add the following NVM specific
operations:
- acquire, release, read
- validate checksum
- read shadow ram

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:14:00 -08:00
Piotr Kwapulinski 23c0e5a16b ixgbe: Add link management support for E610 device
Add low level link management support for E610 device. Link management
operations are handled via the Admin Command Interface. Add the following
link management operations:
- get link capabilities
- set up link
- get media type
- get link status, link status events
- link power management

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 10:13:53 -08:00
Piotr Kwapulinski 7c3aa0fccb ixgbe: Add support for E610 device capabilities detection
Add low level support for E610 device capabilities detection. The
capabilities are discovered via the Admin Command Interface. Discover the
following capabilities:
- function caps: vmdq, dcb, rss, rx/tx qs, msix, nvm, orom, reset
- device caps: vsi, fdir, 1588
- phy caps

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Jan Sokolowski <jan.sokolowski@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 09:49:03 -08:00
Piotr Kwapulinski 46761fd52a ixgbe: Add support for E610 FW Admin Command Interface
Add low level support for Admin Command Interface (ACI). ACI is the
Firmware interface used by a driver to communicate with E610 adapter. Add
the following ACI features:
- data structures, macros, register definitions
- commands handling
- events handling

Co-developed-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Signed-off-by: Stefan Wegrzyn <stefan.wegrzyn@intel.com>
Co-developed-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Tested-by: Bharath R <bharath.r@intel.com>
Signed-off-by: Piotr Kwapulinski <piotr.kwapulinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-20 09:39:52 -08:00
Alexander Lobakin 51205f841a xsk: make xsk_buff_add_frag() really add the frag via __xdp_buff_add_frag()
Currently, xsk_buff_add_frag() only adds the frag to pool's linked list,
not doing anything with the &xdp_buff. The drivers do that manually and
the logic is the same.
Make it really add an skb frag, just like xdp_buff_add_frag() does that,
and freeing frags on error if needed. This allows to remove repeating
code from i40e and ice and not add the same code again and again.

Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexander Lobakin <aleksander.lobakin@intel.com>
Link: https://patch.msgid.link/20241218174435.1445282-5-aleksander.lobakin@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 19:51:14 -08:00
Jakub Kicinski 07e5c4eb94 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.13-rc4).

No conflicts.

Adjacent changes:

drivers/net/ethernet/renesas/rswitch.h
  32fd46f5b6 ("net: renesas: rswitch: remove speed from gwca structure")
  922b4b955a ("net: renesas: rswitch: rework ts tags management")

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-19 11:35:07 -08:00
Joshua Hay 0c1683c681 idpf: trigger SW interrupt when exiting wb_on_itr mode
There is a race condition between exiting wb_on_itr and completion write
backs. For example, we are in wb_on_itr mode and a Tx completion is
generated by HW, ready to be written back, as we are re-enabling
interrupts:

	HW                      SW
	|                       |
	|			| idpf_tx_splitq_clean_all
	|                       | napi_complete_done
	|			|
	| tx_completion_wb 	| idpf_vport_intr_update_itr_ena_irq

That tx_completion_wb happens before the vector is fully re-enabled.
Continuing with this example, it is a UDP stream and the
tx_completion_wb is the last one in the flow (there are no rx packets).
Because the HW generated the completion before the interrupt is fully
enabled, the HW will not fire the interrupt once the timer expires and
the write back will not happen. NAPI poll won't be called.  We have
indicated we're back in interrupt mode but nothing else will trigger the
interrupt. Therefore, the completion goes unprocessed, triggering a Tx
timeout.

To mitigate this, fire a SW triggered interrupt upon exiting wb_on_itr.
This interrupt will catch the rogue completion and avoid the timeout.
Add logic to set the appropriate bits in the vector's dyn_ctl register.

Fixes: 9c4a27da0e ("idpf: enable WB_ON_ITR")
Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17 14:00:26 -08:00
Joshua Hay 93433c1d91 idpf: add support for SW triggered interrupts
SW triggered interrupts are guaranteed to fire after their timer
expires, unlike Tx and Rx interrupts which will only fire after the
timer expires _and_ a descriptor write back is available to be processed
by the driver.

Add the necessary fields, defines, and initializations to enable a SW
triggered interrupt in the vector's dyn_ctl register.

Reviewed-by: Madhu Chittim <madhu.chittim@intel.com>
Signed-off-by: Joshua Hay <joshua.a.hay@intel.com>
Tested-by: Krishneil Singh <krishneil.k.singh@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17 13:28:55 -08:00
Ben Shelton bc10274739 ice: Add MDD logging via devlink health
Add a devlink health reporter for MDD events. The 'dump' handler will
return the information captured in each call to ice_handle_mdd_event().
A device reset (CORER/PFR) will put the reporter back in healthy state.

Signed-off-by: Ben Shelton <benjamin.h.shelton@intel.com>
Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Co-developed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17 11:32:51 -08:00
Przemek Kitszel 2a82874a3b ice: add Tx hang devlink health reporter
Add Tx hang devlink health reporter, see struct ice_tx_hang_event to see
what exactly is reported. For now dump descriptors with little metadata
and skb diagnostic information.

Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com>
Reviewed-by: Wojciech Drewek <wojciech.drewek@intel.com>
Co-developed-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Signed-off-by: Mateusz Polchlopek <mateusz.polchlopek@intel.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17 11:32:46 -08:00
Przemek Kitszel 2846fe5614 ice: rename devlink_port.[ch] to port.[ch]
Drop "devlink_" prefix from files that sit in devlink/.
I'm going to add more files there, and repeating "devlink" does not feel
good. This is also the scheme used in most other places, most notably the
devlink core files are named like that.

devlink.[ch] stays as is.

Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com>
Signed-off-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-17 09:32:08 -08:00
Jacob Keller 39be64c34c ice: cleanup Rx queue context programming functions
The ice_copy_rxq_ctx_to_hw() and ice_write_rxq_ctx() functions perform some
defensive checks which are typically frowned upon by kernel style
guidelines.

In particular, NULL checks on buffers which point to the stack are
discouraged, especially when the functions are static and only called once.
Checks of this sort only serve to hide potential programming error, as we
will not produce the normal crash dump on a NULL access.

In addition, ice_copy_rxq_ctx_to_hw() cannot fail in another way, so could
be made void.

Future support for VF Live Migration will need to introduce an inverse
function for reading Rx queue context from HW registers to unpack it, as
well as functions to pack and unpack Tx queue context from HW.

Rather than copying these style issues into the new functions, lets first
cleanup the existing code.

For the ice_copy_rxq_ctx_to_hw() function:

 * Move the Rx queue index check out of this function.
 * Convert the function to a void return.
 * Use a simple int variable instead of a u8 for the for loop index, and
   initialize it inside the for loop.
 * Update the function description to better align with kernel doc style.

For the ice_write_rxq_ctx() function:

 * Move the Rx queue index check into this function.
 * Update the function description with a Returns: to align with kernel doc
   style.

These changes align the existing write functions to current kernel
style, and will align with the style of the new functions added when we
implement live migration in a future series.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-10-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:01 -08:00
Jacob Keller ac001acc4d ice: move prefetch enable to ice_setup_rx_ctx
The ice_write_rxq_ctx() function is responsible for programming the Rx
Queue context into hardware. It receives the configuration in unpacked form
via the ice_rlan_ctx structure.

This function unconditionally modifies the context to set the prefetch
enable bit. This was done by commit c31a5c25bb ("ice: Always set prefena
when configuring an Rx queue"). Setting this bit makes sense, since
prefetching descriptors is almost always the preferred behavior.

However, the ice_write_rxq_ctx() function is not the place that actually
defines the queue context. We initialize the Rx Queue context in
ice_setup_rx_ctx(). It is surprising to have the Rx queue context changed
by a function who's responsibility is to program the given context to
hardware.

Following the principle of least surprise, move the setting of the prefetch
enable bit out of ice_write_rxq_ctx() and into the ice_setup_rx_ctx().

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-9-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:01 -08:00
Jacob Keller f72588a426 ice: reduce size of queue context fields
The ice_rlan_ctx and ice_tlan_ctx structures have some fields which are
intentionally sized larger than necessary relative to the packed sizes the
data must fit into. This was done because the original ice_set_ctx()
function and its helpers did not correctly handle packing when the packed
bits straddled a byte. This is no longer the case with the use of the
<linux/packing.h> implementation.

Save some bytes in these structures by sizing the variables to the number
of bytes the actual bitpacked fields fit into.

There are a couple of gaps left in the structure, which is a result of the
fields being in the order they appear in the packed bit layout, but where
alignment forces some extra gaps. We could fix this, saving ~8 bytes from
each structure. However, these structures are not used heavily, and the
resulting savings is minimal:

$ bloat-o-meter ice-before-reorder.ko ice-after-reorder.ko
add/remove: 0/0 grow/shrink: 1/1 up/down: 26/-70 (-44)
Function                                     old     new   delta
ice_vsi_cfg_txq                             1873    1899     +26
ice_setup_rx_ctx.constprop                  1529    1459     -70
Total: Before=1459555, After=1459511, chg -0.00%

Thus, the fields are left in the same order as the packed bit layout,
despite the gaps this causes.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-8-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:00 -08:00
Jacob Keller dc4305be46 ice: use <linux/packing.h> for Tx and Rx queue context data
The ice driver needs to write the Tx and Rx queue context when programming
Tx and Rx queues. This is currently done using some bespoke custom logic
via the ice_set_ctx() and its helper functions, along with bit position
definitions in the ice_tlan_ctx_info and ice_rlan_ctx_info structures.

This logic does work, but is problematic for several reasons:

1) ice_set_ctx requires a helper function for each byte size being packed,
   as it uses a separate function to pack u8, u16, u32, and u64 fields.
   This requires 4 functions which contain near-duplicate logic with the
   types changed out.

2) The logic in the ice_pack_ctx_word, ice_pack_ctx_dword, and
   ice_pack_ctx_qword does not handle values which straddle alignment
   boundaries very well. This requires that several fields in the
   ice_tlan_ctx_info and ice_rlan_ctx_info be a size larger than their bit
   size should require.

3) Future support for live migration will require adding unpacking
   functions to take the packed hardware context and unpack it into the
   ice_rlan_ctx and ice_tlan_ctx structures. Implementing this would
   require implementing ice_get_ctx, and its associated helper functions,
   which essentially doubles the amount of code required.

The Linux kernel has had a packing library that can handle this logic since
commit 554aae3500 ("lib: Add support for generic packing operations").
The library was recently extended with support for packing or unpacking an
array of fields, with a similar structure as the ice_ctx_ele structure.

Replace the ice-specific ice_set_ctx() logic with the recently added
pack_fields and packed_field_s infrastructure from <linux/packing.h>

For API simplicity, the Tx and Rx queue context are programmed using
separate ice_pack_txq_ctx() and ice_pack_rxq_ctx(). This avoids needing to
export the packed_field_s arrays. The functions can pointers to the
appropriate ice_txq_ctx_buf_t and ice_rxq_ctx_buf_t types, ensuring that
only buffers of the appropriate size are passed.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-7-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:00 -08:00
Jacob Keller efe39d8b4b ice: use structures to keep track of queue context size
The ice Tx and Rx queue context are currently stored as arrays of bytes
with defined size (ICE_RXQ_CTX_SZ and ICE_TXQ_CTX_SZ). The packed queue
context is often passed to other functions as a simple u8 * pointer, which
does not allow tracking the size. This makes the queue context API easy to
misuse, as you can pass an arbitrary u8 array or pointer.

Introduce wrapper typedefs which use a __packed structure that has the
proper fixed size for the Tx and Rx context buffers. This enables the
compiler to track the size of the value and ensures that passing the wrong
buffer size will be detected by the compiler.

The existing APIs do not benefit much from this change, however the
wrapping structures will be used to simplify the arguments of new packing
functions based on the recently introduced pack_fields API.

Co-developed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-6-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:00 -08:00
Jacob Keller aeeaa9f891 ice: remove int_q_state from ice_tlan_ctx
The int_q_state field of the ice_tlan_ctx structure represents the internal
queue state. However, we never actually need to assign this or read this
during normal operation. In fact, trying to unpack it would not be possible
as it is larger than a u64. Remove this field from the ice_tlan_ctx
structure, and remove its packing field from the ice_tlan_ctx_info array.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20241210-packing-pack-fields-and-ice-implementation-v10-5-ee56a47479ac@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-12-11 20:13:00 -08:00
Linus Torvalds 896d8946da Including fixes from can and netfilter.
Current release - regressions:
 
   - rtnetlink: fix double call of rtnl_link_get_net_ifla()
 
   - tcp: populate XPS related fields of timewait sockets
 
   - ethtool: fix access to uninitialized fields in set RXNFC command
 
   - selinux: use sk_to_full_sk() in selinux_ip_output()
 
 Current release - new code bugs:
 
   - net: make napi_hash_lock irq safe
 
   - eth: bnxt_en: support header page pool in queue API
 
   - eth: ice: fix NULL pointer dereference in switchdev
 
 Previous releases - regressions:
 
   - core: fix icmp host relookup triggering ip_rt_bug
 
   - ipv6:
     - avoid possible NULL deref in modify_prefix_route()
     - release expired exception dst cached in socket
 
   - smc: fix LGR and link use-after-free issue
 
   - hsr: avoid potential out-of-bound access in fill_frame_info()
 
   - can: hi311x: fix potential use-after-free
 
   - eth: ice: fix VLAN pruning in switchdev mode
 
 Previous releases - always broken:
 
   - netfilter:
     - ipset: hold module reference while requesting a module
     - nft_inner: incorrect percpu area handling under softirq
 
   - can: j1939: fix skb reference counting
 
   - eth: mlxsw: use correct key block on Spectrum-4
 
   - eth: mlx5: fix memory leak in mlx5hws_definer_calc_layout
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmdRve4SHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOk/TUQAItDVkTQDiAUUqd0TDH2SSnboxXozcMF
 fKVrdWulFAOe7qcajUqjkzvTDjAOw9Xbfh8rEiFBLaXmUzSUT2eqbm2VahvWIR5/
 k08v7fTTuCNzEwhnbQlsZ47Nd26LJVPwOvbtE/4V8d50pU/serjWuI/74tUCWAjn
 DQOjyyqRjKgKKY+WWCQ6g4tVpD//DB4Sj15GiE3MhlW1f08AAnPJSe2oTaq0RZBG
 nXo7abOGn8x3RYilrlp/ZwWYuNpVI4oF+qmp+t/46NV+7ER1JgrC97k0kFyyCYVD
 g7vBvFjvA7vDmiuzfsOW2n7IRdcfBtkfi8UJYOIvVgJg7KDF0DXoxj3qD4FagI35
 RWRMJw+PoNlFFkPprlp0we/19jPJWOO6rx+AOMEQD78jrH7NoFQ/+eeBf/nppfjy
 wX0LD1aQsgPk2ju0I8GbcM/qaJ81EJUiYYVLJHieH0+vvmqts8cMDRLZf5t08EHa
 myXcRGB9N8gjguGp5mdR5KmtY82zASlNC0PbDp3nlPYc3e/opCmMRYGjBohO+vqn
 n7u250WThPwiBtOwYmSbcK7zpS034/VX0ufnTT2X3MWnFGGNDv6XVmho/OBuCHqJ
 m/EiJo/D9qVGAIql/vAxN9a4lQKrcZgaMzCyEPYCf7rtLmzx7sfyHWbf/GZlUAU/
 9dUUfWqCbZcL
 =nkPk
 -----END PGP SIGNATURE-----

Merge tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from can and netfilter.

  Current release - regressions:

   - rtnetlink: fix double call of rtnl_link_get_net_ifla()

   - tcp: populate XPS related fields of timewait sockets

   - ethtool: fix access to uninitialized fields in set RXNFC command

   - selinux: use sk_to_full_sk() in selinux_ip_output()

  Current release - new code bugs:

   - net: make napi_hash_lock irq safe

   - eth:
      - bnxt_en: support header page pool in queue API
      - ice: fix NULL pointer dereference in switchdev

  Previous releases - regressions:

   - core: fix icmp host relookup triggering ip_rt_bug

   - ipv6:
      - avoid possible NULL deref in modify_prefix_route()
      - release expired exception dst cached in socket

   - smc: fix LGR and link use-after-free issue

   - hsr: avoid potential out-of-bound access in fill_frame_info()

   - can: hi311x: fix potential use-after-free

   - eth: ice: fix VLAN pruning in switchdev mode

  Previous releases - always broken:

   - netfilter:
      - ipset: hold module reference while requesting a module
      - nft_inner: incorrect percpu area handling under softirq

   - can: j1939: fix skb reference counting

   - eth:
      - mlxsw: use correct key block on Spectrum-4
      - mlx5: fix memory leak in mlx5hws_definer_calc_layout"

* tag 'net-6.13-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (76 commits)
  net :mana :Request a V2 response version for MANA_QUERY_GF_STAT
  net: avoid potential UAF in default_operstate()
  vsock/test: verify socket options after setting them
  vsock/test: fix parameter types in SO_VM_SOCKETS_* calls
  vsock/test: fix failures due to wrong SO_RCVLOWAT parameter
  net/mlx5e: Remove workaround to avoid syndrome for internal port
  net/mlx5e: SD, Use correct mdev to build channel param
  net/mlx5: E-Switch, Fix switching to switchdev mode in MPV
  net/mlx5: E-Switch, Fix switching to switchdev mode with IB device disabled
  net/mlx5: HWS: Properly set bwc queue locks lock classes
  net/mlx5: HWS: Fix memory leak in mlx5hws_definer_calc_layout
  bnxt_en: handle tpa_info in queue API implementation
  bnxt_en: refactor bnxt_alloc_rx_rings() to call bnxt_alloc_rx_agg_bmap()
  bnxt_en: refactor tpa_info alloc/free into helpers
  geneve: do not assume mac header is set in geneve_xmit_skb()
  mlxsw: spectrum_acl_flex_keys: Use correct key block on Spectrum-4
  ethtool: Fix wrong mod state in case of verbose and no_mask bitset
  ipmr: tune the ipmr_can_free_table() checks.
  netfilter: nft_set_hash: skip duplicated elements pending gc run
  netfilter: ipset: Hold module reference while requesting a module
  ...
2024-12-05 10:25:06 -08:00
Yuan Can 0566f83d20 igb: Fix potential invalid memory access in igb_init_module()
The pci_register_driver() can fail and when this happened, the dca_notifier
needs to be unregistered, otherwise the dca_notifier can be called when
igb fails to install, resulting to invalid memory access.

Fixes: bbd98fe48a ("igb: Fix DCA errors and do not use context index for 82576")
Signed-off-by: Yuan Can <yuancan@huawei.com>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-03 10:11:53 -08:00
Tore Amundsen f72ce14b23 ixgbe: Correct BASE-BX10 compliance code
SFF-8472 (section 5.4 Transceiver Compliance Codes) defines bit 6 as
BASE-BX10. Bit 6 means a value of 0x40 (decimal 64).

The current value in the source code is 0x64, which appears to be a
mix-up of hex and decimal values. A value of 0x64 (binary 01100100)
incorrectly sets bit 2 (1000BASE-CX) and bit 5 (100BASE-FX) as well.

Fixes: 1b43e0d20f ("ixgbe: Add 1000BASE-BX support")
Signed-off-by: Tore Amundsen <tore@amundsen.org>
Reviewed-by: Paul Menzel <pmenzel@molgen.mpg.de>
Acked-by: Ernesto Castellotti <ernesto@castellotti.net>
Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel)
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2024-12-03 10:11:53 -08:00