Commit Graph

61 Commits

Author SHA1 Message Date
Michal Wajdeczko e497957fee drm/xe/pf: Invalidate LMTT during LMEM unprovisioning
Invalidate LMTT immediately after removing VF's LMTT page tables
and clearing root PTE in the LMTT PD to avoid any invalid access
by the hardware (and VF) due to stale data.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Michał Winiarski <michal.winiarski@intel.com>
Cc: Piotr Piórkowski <piotr.piorkowski@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250711193316.1920-6-michal.wajdeczko@intel.com
2025-07-15 13:05:20 +02:00
Himal Prasad Ghimiray 3ee9f2058a drm/xe/vm: Add a helper xe_vm_range_tilemask_tlb_invalidation()
Introduce xe_vm_range_tilemask_tlb_invalidation(), which issues a TLB
invalidation for a specified address range across GTs indicated by a
tilemask.

v2 (Matthew Brost)
- Move WARN_ON_ONCE to svm caller
- Remove xe_gt_tlb_invalidation_vma
- s/XE_WARN_ON/WARN_ON_ONCE

v3
- Rebase

Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250609041616.1723636-1-himal.prasad.ghimiray@intel.com
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
2025-06-13 12:51:43 +05:30
Daniele Ceraolo Spurio 0b93b7dcd9 drm/xe: Fix early wedge on GuC load failure
When the GuC fails to load we declare the device wedged. However, the
very first GuC load attempt on GT0 (from xe_gt_init_hwconfig) is done
before the GT1 GuC objects are initialized, so things go bad when the
wedge code attempts to cleanup GT1. To fix this, check the initialization
status in the functions called during wedge.

Fixes: 7dbe8af13c ("drm/xe: Wedge the entire device")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jonathan Cavitt <jonathan.cavitt@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Zhanjun Dong <zhanjun.dong@intel.com>
Cc: stable@vger.kernel.org # v6.12+: 1e1981b16bb1: drm/xe: Fix taking invalid lock on wedge
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250611214453.1159846-2-daniele.ceraolospurio@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-06-12 15:17:12 -07:00
Thomas Hellström b88f48f865 drm/xe: Fix an out-of-bounds shift when invalidating TLB
When the size of the range invalidated is larger than
rounddown_pow_of_two(ULONG_MAX),
The function macro roundup_pow_of_two(length) will hit an out-of-bounds
shift [1].

Use a full TLB invalidation for such cases.
v2:
- Use a define for the range size limit over which we use a full
  TLB invalidation. (Lucas)
- Use a better calculation of the limit.

[1]:
[   39.202421] ------------[ cut here ]------------
[   39.202657] UBSAN: shift-out-of-bounds in ./include/linux/log2.h:57:13
[   39.202673] shift exponent 64 is too large for 64-bit type 'long unsigned int'
[   39.202688] CPU: 8 UID: 0 PID: 3129 Comm: xe_exec_system_ Tainted: G     U             6.14.0+ #10
[   39.202690] Tainted: [U]=USER
[   39.202690] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023
[   39.202691] Call Trace:
[   39.202692]  <TASK>
[   39.202695]  dump_stack_lvl+0x6e/0xa0
[   39.202699]  ubsan_epilogue+0x5/0x30
[   39.202701]  __ubsan_handle_shift_out_of_bounds.cold+0x61/0xe6
[   39.202705]  xe_gt_tlb_invalidation_range.cold+0x1d/0x3a [xe]
[   39.202800]  ? find_held_lock+0x2b/0x80
[   39.202803]  ? mark_held_locks+0x40/0x70
[   39.202806]  xe_svm_invalidate+0x459/0x700 [xe]
[   39.202897]  drm_gpusvm_notifier_invalidate+0x4d/0x70 [drm_gpusvm]
[   39.202900]  __mmu_notifier_release+0x1f5/0x270
[   39.202905]  exit_mmap+0x40e/0x450
[   39.202912]  __mmput+0x45/0x110
[   39.202914]  exit_mm+0xc5/0x130
[   39.202916]  do_exit+0x21c/0x500
[   39.202918]  ? lockdep_hardirqs_on_prepare+0xdb/0x190
[   39.202920]  do_group_exit+0x36/0xa0
[   39.202922]  get_signal+0x8f8/0x900
[   39.202926]  arch_do_signal_or_restart+0x35/0x100
[   39.202930]  syscall_exit_to_user_mode+0x1fc/0x290
[   39.202932]  do_syscall_64+0xa1/0x180
[   39.202934]  ? do_user_addr_fault+0x59f/0x8a0
[   39.202937]  ? lock_release+0xd2/0x2a0
[   39.202939]  ? do_user_addr_fault+0x5a9/0x8a0
[   39.202942]  ? trace_hardirqs_off+0x4b/0xc0
[   39.202944]  ? clear_bhb_loop+0x25/0x80
[   39.202946]  ? clear_bhb_loop+0x25/0x80
[   39.202947]  ? clear_bhb_loop+0x25/0x80
[   39.202950]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[   39.202952] RIP: 0033:0x7fa945e543e1
[   39.202961] Code: Unable to access opcode bytes at 0x7fa945e543b7.
[   39.202962] RSP: 002b:00007ffca8fb4170 EFLAGS: 00000293
[   39.202963] RAX: 000000000000003d RBX: 0000000000000000 RCX: 00007fa945e543e3
[   39.202964] RDX: 0000000000000000 RSI: 00007ffca8fb41ac RDI: 00000000ffffffff
[   39.202964] RBP: 00007ffca8fb4190 R08: 0000000000000000 R09: 00007fa945f600a0
[   39.202965] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[   39.202966] R13: 00007fa9460dd310 R14: 00007ffca8fb41ac R15: 0000000000000000
[   39.202970]  </TASK>
[   39.202970] ---[ end trace ]---

Fixes: 332dd0116c ("drm/xe: Add range based TLB invalidations")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> #v1
Link: https://lore.kernel.org/r/20250326151634.36916-1-thomas.hellstrom@linux.intel.com
2025-03-27 16:06:36 +01:00
Matthew Brost 074e40d9c2 drm/xe: Nuke VM's mapping upon close
Clear root PT entry and invalidate entire VM's address space when
closing the VM. Will prevent the GPU from accessing any of the VM's
memory after closing.

v2:
 - s/vma/vm in kernel doc (CI)
 - Don't nuke migration VM as this occur at driver unload (CI)
v3:
 - Rebase and pull into SVM series (Thomas)
 - Wait for pending binds (Thomas)
v5:
 - Remove xe_gt_tlb_invalidation_fence_fini in error case (Matt Auld)
 - Drop local migration bool (Thomas)
v7:
 - Add drm_dev_enter/exit protecting invalidation (CI, Matt Auld)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250306012657.3505757-12-matthew.brost@intel.com
2025-03-06 11:35:38 -08:00
Lucas De Marchi 6acea03f98 drm/xe: Remove "graphics tile" from kernel doc
Avoid using "graphics tile" to refer to GT since it's ambiguous: it's
**part** of a tile and there's also "media gt". In several places it's
documented as "GT structure", so just follow it.

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250103001111.331684-3-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-01-03 12:43:02 -08:00
Lucas De Marchi 5001ef3af8 drm/xe: Fix tlb invalidation when wedging
If GuC fails to load, the driver wedges, but in the process it tries to
do stuff that may not be initialized yet. This moves the
xe_gt_tlb_invalidation_init() to be done earlier: as its own doc says,
it's a software-only initialization and should had been named with the
_early() suffix.

Move it to be called by xe_gt_init_early(), so the locks and seqno are
initialized, avoiding a NULL ptr deref when wedging:

	xe 0000:03:00.0: [drm] *ERROR* GT0: load failed: status: Reset = 0, BootROM = 0x50, UKernel = 0x00, MIA = 0x00, Auth = 0x01
	xe 0000:03:00.0: [drm] *ERROR* GT0: firmware signature verification failed
	xe 0000:03:00.0: [drm] *ERROR* CRITICAL: Xe has declared device 0000:03:00.0 as wedged.
	...
	BUG: kernel NULL pointer dereference, address: 0000000000000000
	#PF: supervisor read access in kernel mode
	#PF: error_code(0x0000) - not-present page
	PGD 0 P4D 0
	Oops: Oops: 0000 [#1] PREEMPT SMP NOPTI
	CPU: 9 UID: 0 PID: 3908 Comm: modprobe Tainted: G     U  W          6.13.0-rc4-xe+ #3
	Tainted: [U]=USER, [W]=WARN
	Hardware name: Intel Corporation Alder Lake Client Platform/AlderLake-S ADP-S DDR5 UDIMM CRB, BIOS ADLSFWI1.R00.3275.A00.2207010640 07/01/2022
	RIP: 0010:xe_gt_tlb_invalidation_reset+0x75/0x110 [xe]

This can be easily triggered by poking the GuC binary to force a
signature failure. There will still be an extra message,

	xe 0000:03:00.0: [drm] *ERROR* GT0: GuC mmio request 0x4100: no reply 0x4100

but that's better than a NULL ptr deref.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3956
Fixes: 7dbe8af13c ("drm/xe: Wedge the entire device")
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250103001111.331684-2-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2025-01-03 12:43:01 -08:00
Daniele Ceraolo Spurio 65338639b7 drm/xe: Call invalidation_fence_fini for PT inval fences in error state
Invalidation_fence_init takes a PM reference, which is released in its
_fini counterpart, so we need to make sure that the latter is called,
even if the fence is in an error state.

Since we already have a function that calls _fini() and signals the
fence in the tlb inval code, we can expose that and call it from the PT
code.

Fixes: f002702290 ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight")
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: <stable@vger.kernel.org> # v6.11+
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241206015022.1567113-1-daniele.ceraolospurio@intel.com
2024-12-10 13:11:14 -08:00
Nirmoy Das 68634b12d7 drm/xe: Ignore GGTT TLB inval errors during GT reset
During GT reset, GGTT TLB invalidations may fail. This is acceptable
as the reset will clear GGTT caches. Suppress only -ECANCELED other
return codes are still unexpected error.

v2: Add code comment(Matt).

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/3389
Suggested-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241112170123.1236443-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-11-14 12:20:23 +01:00
Nirmoy Das e1f6fa5566 drm/xe/guc/tlb: Flush g2h worker in case of tlb timeout
Flush the g2h worker explicitly if TLB timeout happens which is
observed on LNL and that points to the recent scheduling issue with
E-cores on LNL.

This is similar to the recent fix:
commit e515272338 ("drm/xe/guc/ct: Flush g2h worker in case of g2h
response timeout") and should be removed once there is E core
scheduling fix.

v2: Add platform check(Himal)
v3: Remove gfx platform check as the issue related to cpu
    platform(John)
    Use the common WA macro(John) and print when the flush
    resolves timeout(Matt B)
v4: Remove the resolves log and do the flush before taking
    pending_lock(Matt A)

Cc: Badal Nilawar <badal.nilawar@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: stable@vger.kernel.org # v6.11+
Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2687
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241029120117.449694-3-nirmoy.das@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-11-01 06:23:26 -07:00
Himal Prasad Ghimiray 1d5bf4fd1b
drm/xe/gt_tlb_invalidation_ggtt: Update handling of xe_force_wake_get return
xe_force_wake_get() now returns the reference count-incremented domain
mask. If it fails for individual domains, the return value will always
be 0. However, for XE_FORCEWAKE_ALL, it may return a non-zero value even
in the event of failure. Update the return handling of xe_force_wake_get()
to reflect this behavior, and ensure that the return value is passed as
input to xe_force_wake_put().

v3
- return xe_wakeref_t instead of int in xe_force_wake_get()

v5
- return unsigned int from xe_force_wake_get()
- remove redundant warns

v7
- Fix commit message

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241014075601.2324382-21-himal.prasad.ghimiray@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-10-17 10:17:09 -04:00
Matthew Auld cfcbc0520d drm/xe: fix unbalanced rpm put() with fence_fini()
Currently we can call fence_fini() twice if something goes wrong when
sending the GuC CT for the tlb request, since we signal the fence and
return an error, leading to the caller also calling fini() on the error
path in the case of stack version of the flow, which leads to an extra
rpm put() which might later cause device to enter suspend when it
shouldn't. It looks like we can just drop the fini() call since the
fence signaller side will already call this for us.

There are known mysterious splats with device going to sleep even with
an rpm ref, and this could be one candidate.

v2 (Matt B):
  - Prefer warning if we detect double fini()

Fixes: 0a382f9bc5 ("drm/xe: Hold a PM ref when GT TLB invalidations are inflight")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241009084808.204432-3-matthew.auld@intel.com
2024-10-10 09:15:58 +01:00
Matt Roper 5fd12cc444 drm/xe/tlb: Convert register access to use xe_mmio
Stop using GT pointers for register access.

Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240910234719.3335472-83-matthew.d.roper@intel.com
2024-09-11 15:32:51 -07:00
Nirmoy Das 39fa14e5bd drm/xe: Add stats for tlb invalidation count
Add stats for tlb invalidation count which can be viewed with per GT
stat debugfs file.

Example output:
cat /sys/kernel/debug/dri/0/gt0/stats
tlb_inval_count: 22

v2: fix #include order(Tejas)

Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Sai Gowtham Ch <sai.gowtham.ch@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240810191522.18616-2-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-08-12 19:08:53 +02:00
Matthew Brost 6482253e6e drm/xe: Remove fence check from send_tlb_invalidation
'fence' argument in send_tlb_invalidation cannot be NULL, remove
non-NULL check from send_tlb_invalidation.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202407231049.esig0Fkb-lkp@intel.com/
Fixes: 61ac035361 ("drm/xe: Drop xe_gt_tlb_invalidation_wait")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240723190714.1744653-1-matthew.brost@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-07-24 11:14:53 +02:00
Matthew Brost 0a382f9bc5 drm/xe: Hold a PM ref when GT TLB invalidations are inflight
Avoid GT TLB invalidation timeouts by holding a PM ref when
invalidations are inflight.

v2:
 - Drop PM ref before signaling fence (CI)
v3:
 - Move invalidation_fence_signal helper in tlb timeout to previous
   patch (Matthew Auld)

Fixes: dd08ebf6c3 ("drm/xe: Introduce a new DRM driver for Intel GPUs")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-4-matthew.brost@intel.com
2024-07-19 19:45:32 -07:00
Matthew Brost 61ac035361 drm/xe: Drop xe_gt_tlb_invalidation_wait
Having two methods to wait on GT TLB invalidations is not ideal. Remove
xe_gt_tlb_invalidation_wait and only use GT TLB invalidation fences.

In addition to two methods being less than ideal, once GT TLB
invalidations are coalesced the seqno cannot be assigned during
xe_gt_tlb_invalidation_ggtt/range. Thus xe_gt_tlb_invalidation_wait
would not have a seqno to wait one. A fence however can be armed and
later signaled.

v3:
 - Add explaination about coalescing to commit message
v4:
 - Don't put dma fence if defined on stack (CI)
v5:
 - Initialize ret to zero (CI)
v6:
 - Use invalidation_fence_signal helper in tlb timeout (Matthew Auld)

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-3-matthew.brost@intel.com
2024-07-19 19:45:31 -07:00
Matthew Brost a522b285c6 drm/xe: Add xe_gt_tlb_invalidation_fence_init helper
Other layers should not be touching struct xe_gt_tlb_invalidation_fence
directly, add helper for initialization.

v2:
 - Add dma_fence_get and list init to xe_gt_tlb_invalidation_fence_init

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240719172905.1527927-2-matthew.brost@intel.com
2024-07-19 19:45:29 -07:00
Nirmoy Das eb523ec382 drm/xe/guc: Configure TLB timeout based on CT buffer size
GuC TLB invalidation depends on GuC to process the request from the CT
queue and then the real time to invalidate TLB. Add a function to return
overestimated possible time a TLB inval H2G might take which can be used
as timeout value for TLB invalidation wait time.

v4: Make sure CTB is in 4K blocks(Michal) and other doc fixes
v3: Pass CT to xe_guc_ct_queue_proc_time_jiffies() (Michal)
    Add tlb_timeout_jiffies() that replaces TLB_TIMEOUT(Michal)
v2: Address reviews from Michal.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/1622
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
Suggested-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Acked-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240628085845.2369-1-nirmoy.das@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
2024-07-01 17:38:48 +02:00
Michal Wajdeczko ef3fcfe063 drm/xe/vf: Don't use register based TLB invalidation if VF
VF drivers can only use GuC-based TLB invalidation, as they don't
have access to the related registers. However, VFs shouldn't need
any explicit TLB invalidation before enabling CTB communication,
as there will be an implicit GGTT TLB invalidation issued by the
GuC itself as part of MMIO-based action handling.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Piotr Piórkowski <piotr.piorkowski@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240619214557.905-8-michal.wajdeczko@intel.com
2024-06-20 19:49:41 +02:00
Radhakrishna Sripada 501c4255c4 drm/xe/trace: Print device_id in xe_trace events
In multi-gpu environments it is important to know the device
gt events belongs to. The tracing information includes the device_id
to indicate the device the event is associated with.

v2: Use variable sized variant to display dev name(Gustavo)
v3: Pass single argument to __assign_str to fix kunit error
v4: Remove unused sting_helper library include

Suggested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Gustavo Sousa <gustavo.sousa@intel.com>
Signed-off-by: Radhakrishna Sripada <radhakrishna.sripada@intel.com>
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240607182943.3572524-6-radhakrishna.sripada@intel.com
2024-06-12 09:25:13 -07:00
Michal Wajdeczko 93dd6ad89c drm/xe: Don't rely on xe_force_wake.h to be included elsewhere
While xe_force_wake.h is now included from the xe_device.h, we
want to drop that include as we don't need it there. Explicitly
include xe_force_wake.h where needed.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240507110959.2747-3-michal.wajdeczko@intel.com
2024-05-07 23:21:17 +02:00
Nirmoy Das e29a7a34c3 drm/xe: Remove uninitialized end var from xe_gt_tlb_invalidation_range()
This fixes commit c4f1870362 ("drm/xe: Add
xe_gt_tlb_invalidation_range and convert PT layer to use this")
which added the end variable as part of the function param.

v2: Add fixes tag(Matt)

Fixes: c4f1870362 ("drm/xe: Add xe_gt_tlb_invalidation_range and convert PT layer to use this")
Cc: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240429203039.26918-1-nirmoy.das@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
2024-04-30 06:13:33 -07:00
Matthew Brost c4f1870362 drm/xe: Add xe_gt_tlb_invalidation_range and convert PT layer to use this
xe_gt_tlb_invalidation_range accepts a start and end address rather than
a VMA. This will enable multiple VMAs to be invalidated in a single
invalidation. Update the PT layer to use this new function.

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Oak Zeng <oak.zeng@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240425045513.1913039-13-matthew.brost@intel.com
2024-04-26 12:10:07 -07:00
Rodrigo Vivi 8ed9aaae39
drm/xe: Force wedged state and block GT reset upon any GPU hang
In many validation situations when debugging GPU Hangs,
it is useful to preserve the GT situation from the moment
that the timeout occurred.

This patch introduces a module parameter that could be used
on situations like this.

If xe.wedged module parameter is set to 2, Xe will be declared
wedged on every single execution timeout (a.k.a. GPU hang) right
after devcoredump snapshot capture and without attempting any
kind of GT reset and blocking entirely any kind of execution.

v2: Really block gt_reset from guc side. (Lucas)
    s/wedged/busted (Lucas)

v3: - s/busted/wedged
    - Really use global_flags (Dafna)
    - More robust timeout handling when wedging it.

v4: A really robust clean exit done by Matt Brost.
    No more kernel warns on unbind.

v5: Simplify error message (Lucas)

Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Dafna Hirschfeld <dhirschfeld@habana.ai>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Himanshu Somaiya <himanshu.somaiya@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240423221817.1285081-3-rodrigo.vivi@intel.com
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2024-04-24 12:12:58 -04:00
Thomas Hellström 0453f17575 drm/xe: Make TLB invalidation fences unordered
They can actually complete out-of-order, so allocate a unique
fence context for each fence.

Fixes: 5387e865d9 ("drm/xe: Add TLB invalidation fence after rebinds issued from execs")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: <stable@vger.kernel.org> # v6.8+
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240327091136.3271-4-thomas.hellstrom@linux.intel.com
2024-03-28 08:39:29 +01:00
Daniele Ceraolo Spurio 649a125a88 drm/xe: Always check force_wake_get return code
A force_wake_get failure means that the HW might not be awake for the
access we're doing; this can lead to an immediate error or it can be a
more subtle problem (e.g. a register read might return an incorrect
value that is still valid, leading the driver to make a wrong choice
instead of flagging an error).
We avoid an error from the force_wake function because callers might
handle or tolerate the error, but this only works if all callers
are checking the error code. The majority already do, but a few are not.
These are mainly falling into 3 categories, which are each handled
differently:

1) error capture: in this case we want to continue the capture, but we
   log an info message in dmesg to notify the user that the capture
   might have incorrect data.

2) ioctl: in this case we return a -EIO error to userspace

3) unabortable actions: these are scenarios where we can't simply abort
   and retry and so it's better to just try it anyway because there is a
   chance the HW is awake even with the failure. In this case we throw a
   warning so we know there was a forcewake problem if something fails
   down the line.

v2: use gt_WARN_ON where appropriate

Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Reviewed-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240318154924.3453513-1-daniele.ceraolospurio@intel.com
2024-03-20 14:13:58 -07:00
Matthew Brost 27ee413bbc drm/xe: Do not grab forcewakes when issuing GGTT TLB invalidation via GuC
Forcewakes are not required for communication with the GuC via CTB
as it is a memory based interfaced. Acquring forcewakes takes
considerable time. With that, do not grab a forcewake when issuing a
GGTT TLB invalidation via the GuC.

Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240229194520.200642-1-matthew.brost@intel.com
2024-03-05 16:48:55 -08:00
Matthew Brost a9e483dda3 drm/xe: Don't support execlists in xe_gt_tlb_invalidation layer
The xe_gt_tlb_invalidation layer implements TLB invalidations for a GuC
backend. Simply return if in execlists mode. A follow up may properly
implement the xe_gt_tlb_invalidation layer for both GuC and execlists.

Fixes: a9351846d9 ("drm/xe: Break of TLB invalidation into its own file")
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240222232021.3911545-4-matthew.brost@intel.com
2024-02-23 11:44:59 -08:00
Matthew Brost 3121fed0c5 drm/xe: Cleanup some layering in GGTT
xe_ggtt.c touched GuC layers which is incorrect. Call into
xe_gt_tlb_invalidation layer instead.

Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20240222232021.3911545-3-matthew.brost@intel.com
2024-02-23 11:44:57 -08:00
Michal Wajdeczko e8b9b3097c drm/xe: Report TLB timeout using GT oriented functions
We track TLB invalidation seqno per GT and we have GT oriented
message helpers, so it's better to use GT oriented log functions.

Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20231218190629.502-3-michal.wajdeczko@intel.com
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
2023-12-21 16:31:29 -05:00
Michal Wajdeczko b67cb798e4 drm/xe/guc: Include only required GuC ABI headers
On i915 we were adding new GuC ABI headers directly to guc_fwif.h
file since we were replacing old definitions from that file.

On xe driver we could do more and better by including ABI headers
only in files that need those definitions.

Link: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/741
Cc: Jani Nikula <jani.nikula@intel.com>
Acked-by: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20231128203203.1147-3-michal.wajdeczko@intel.com
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:45:08 -05:00
Michal Wajdeczko 3d78923bd0 drm/xe/guc: Promote guc_to_gt/xe helpers to .h
Duplicating these helpers in almost every .c file is a bad idea.
Define them as inlines in .h file to allow proper reuse.

Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Jani Nikula <jani.nikula@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:44:32 -05:00
Pallavi Mishra ebb00b285b drm/xe: Dump CTB during TLB timeout
Print CTB info during TLB invalidation
timeout event.

Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Pallavi Mishra <pallavi.mishra@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:43:32 -05:00
Francois Dugast c73acc1eeb drm/xe: Use Xe assert macros instead of XE_WARN_ON macro
The XE_WARN_ON macro maps to WARN_ON which is not justified
in many cases where only a simple debug check is needed.
Replace the use of the XE_WARN_ON macro with the new xe_assert
macros which relies on drm_*. This takes a struct drm_device
argument, which is one of the main changes in this commit. The
other main change is that the condition is reversed, as with
XE_WARN_ON a message is displayed if the condition is true,
whereas with xe_assert it is if the condition is false.

v2:
- Rebase
- Keep WARN splats in xe_wopcm.c (Matt Roper)

v3:
- Rebase

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:41:08 -05:00
Francois Dugast 99fea68288 drm/xe: Prefer WARN() over BUG() to avoid crashing the kernel
Replace calls to XE_BUG_ON() with calls XE_WARN_ON() which in turn calls
WARN() instead of BUG(). BUG() crashes the kernel and should only be
used when it is absolutely unavoidable in case of catastrophic and
unrecoverable failures, which is not the case here.

Signed-off-by: Francois Dugast <francois.dugast@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:39:17 -05:00
Matthew Auld 7d623575a3 drm/xe: drop xe_device_mem_access_get() from invalidation_vma
Lockdep gives the following splat:

[  594.158863] ffff888140da53f0 (&vm->userptr.notifier_lock){++++}-{3:3}, at: vma_userptr_invalidate+0xeb/0x330 [xe]
[  594.158921]
               but task is already holding lock:
[  594.158926] ffffffff82761940
(mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: unmap_vmas+0x0/0x1c0
[  594.158941]
               which lock already depends on the new lock.

[  594.158947]
               the existing dependency chain (in reverse order) is:
[  594.158953]
               -> #5 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}:
[  594.158961]        fs_reclaim_acquire+0x68/0xd0
[  594.158969]        __kmem_cache_alloc_node+0x2c/0x1b0
[  594.158975]        kmalloc_node_trace+0x1d/0xb0
[  594.158983]        alloc_worker+0x18/0x50
[  594.158989]        init_rescuer.part.0+0x13/0xa0
[  594.158995]        workqueue_init+0xdf/0x210
[  594.159001]        kernel_init_freeable+0x5c/0x2f0
[  594.159009]        kernel_init+0x11/0x1a0
[  594.159017]        ret_from_fork+0x29/0x50
[  594.159023]
               -> #4 (fs_reclaim){+.+.}-{0:0}:
[  594.159031]        fs_reclaim_acquire+0xa0/0xd0
[  594.159037]        __kmem_cache_alloc_node+0x2c/0x1b0
[  594.159042]        kmalloc_trace+0x20/0xb0
[  594.159048]        acpi_device_add+0x25a/0x3f0
[  594.159056]        acpi_add_single_object+0x387/0x750
[  594.159063]        acpi_bus_check_add+0x108/0x280
[  594.159069]        acpi_bus_scan+0x34/0xf0
[  594.159075]        acpi_scan_init+0xed/0x2b0
[  594.159082]        acpi_init+0x21e/0x520
[  594.159087]        do_one_initcall+0x53/0x260
[  594.159092]        kernel_init_freeable+0x18a/0x2f0
[  594.159099]        kernel_init+0x11/0x1a0
[  594.159105]        ret_from_fork+0x29/0x50
[  594.159110]
               -> #3 (acpi_device_lock){+.+.}-{3:3}:
[  594.159117]        __mutex_lock+0x95/0xd10
[  594.159122]        acpi_enable_wakeup_device_power+0x30/0x120
[  594.159130]        __acpi_device_wakeup_enable+0x34/0x110
[  594.159138]        acpi_pm_set_device_wakeup+0x55/0x140
[  594.159143]        __pci_enable_wake+0x56/0xb0
[  594.159150]        pci_finish_runtime_suspend+0x35/0x80
[  594.159157]        pci_pm_runtime_suspend+0xb5/0x1a0
[  594.159162]        __rpm_callback+0x3c/0x110
[  594.159170]        rpm_callback+0x58/0x70
[  594.159176]        rpm_suspend+0x15c/0x6f0
[  594.159182]        pm_runtime_work+0x9b/0xb0
[  594.159188]        process_one_work+0x263/0x520
[  594.159195]        worker_thread+0x4d/0x3b0
[  594.159200]        kthread+0xeb/0x120
[  594.159206]        ret_from_fork+0x29/0x50
[  594.159211]
               -> #2 (acpi_wakeup_lock){+.+.}-{3:3}:
[  594.159218]        __mutex_lock+0x95/0xd10
[  594.159223]        acpi_pm_set_device_wakeup+0x7a/0x140
[  594.159228]        __pci_enable_wake+0x77/0xb0
[  594.159234]        pci_pm_runtime_resume+0x70/0xd0
[  594.159240]        __rpm_callback+0x3c/0x110
[  594.159246]        rpm_callback+0x58/0x70
[  594.159252]        rpm_resume+0x50d/0x7a0
[  594.159258]        rpm_resume+0x267/0x7a0
[  594.159264]        __pm_runtime_resume+0x45/0x90
[  594.159270]        xe_pm_runtime_resume_and_get+0x12/0x50 [xe]
[  594.159314]        xe_device_mem_access_get+0x97/0xc0 [xe]
[  594.159346]        hw_engines+0x65/0xf0 [xe]
[  594.159380]        seq_read_iter+0x10d/0x4b0
[  594.159385]        seq_read+0x9e/0xd0
[  594.159390]        full_proxy_read+0x4e/0x80
[  594.159396]        vfs_read+0xb6/0x310
[  594.159401]        ksys_read+0x60/0xe0
[  594.159406]        do_syscall_64+0x38/0x90
[  594.159413]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  594.159419]
               -> #1 (&xe->mem_access.lock){+.+.}-{3:3}:
[  594.159427]        xe_device_mem_access_get+0x43/0xc0 [xe]
[  594.159457]        xe_gt_tlb_invalidation_vma+0x53/0x190 [xe]
[  594.159490]        invalidation_fence_init+0x1d2/0x2c0 [xe]
[  594.159529]        __xe_pt_unbind_vma+0x151/0x4e0 [xe]
[  594.159564]        vm_bind_ioctl+0x48a/0xae0 [xe]
[  594.159602]        async_op_work_func+0x20c/0x530 [xe]
[  594.159634]        process_one_work+0x263/0x520
[  594.159640]        worker_thread+0x4d/0x3b0
[  594.159646]        kthread+0xeb/0x120
[  594.159650]        ret_from_fork+0x29/0x50
[  594.159655]
               -> #0 (&vm->userptr.notifier_lock){++++}-{3:3}:
[  594.159663]        __lock_acquire+0x16fa/0x2850
[  594.159670]        lock_acquire+0xd2/0x2e0
[  594.159676]        down_write+0x36/0xd0
[  594.159681]        vma_userptr_invalidate+0xeb/0x330 [xe]
[  594.159714]        __mmu_notifier_invalidate_range_start+0x239/0x2a0
[  594.159722]        unmap_vmas+0x1ac/0x1c0
[  594.159727]        unmap_region+0xb5/0x120
[  594.159732]        do_vmi_align_munmap+0x2be/0x430
[  594.159739]        do_vmi_munmap+0xea/0x120
[  594.159744]        __vm_munmap+0x9c/0x160
[  594.159750]        __x64_sys_munmap+0x12/0x20
[  594.159756]        do_syscall_64+0x38/0x90
[  594.159761]        entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  594.159768]
               other info that might help us debug this:

[  594.159773] Chain exists of:
                 &vm->userptr.notifier_lock --> fs_reclaim -->
mmu_notifier_invalidate_range_start

[  594.159785]  Possible unsafe locking scenario:

[  594.159790]        CPU0                    CPU1
[  594.159794]        ----                    ----
[  594.159797]   lock(mmu_notifier_invalidate_range_start);
[  594.159802]                                lock(fs_reclaim);
[  594.159808]
lock(mmu_notifier_invalidate_range_start);
[  594.159814]   lock(&vm->userptr.notifier_lock);
[  594.159819]

The VM should be holding a mem_access.ref so this looks like it should
be a false positive and we can just drop the explicit mem_access in
xe_gt_tlb_invalidation().  The GGTT invalidation path also takes care to
hold mem_access.ref so should be fine there also, and we already assert
that we hold access.ref for the GuC communication underneath.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:37:36 -05:00
Matthew Auld 35c8a96439 drm/xe: handle TLB invalidations from CT fast-path
In various test cases that put the system under a heavy load, we can
sometimes see errors with missed TLB invalidations. In such cases we see
the interrupt arrive for the invalidation from the GuC, however the
actual processing of the completion is pushed onto a workqueue and
handled with all the other CT stuff, which might take longer than
expected. Since we expect TLB invalidations to complete within a
reasonable amount of time (at most ~250ms), and they do seem pretty
critical, allow handling directly from the CT fast-path.

v2 (José):
  - Actually use the correct spinlock/unlock_irq, since pending_lock is
    grabbed from IRQ.
v3:
  - Don't publish the TLB fence on the list until after we fully
    initialize it and successfully do the CT send. The list is now only
    protected by the spin_lock pending_lock and we can't hold that
    across the entire TLB send operation.
v4 (Matt Brost):
  - Be careful with racing against fast CT path writing the seqno,
    before we have actually published the fence.

References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/297
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/320
References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/449
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:23 -05:00
Matthew Auld 4aa5e3594f drm/xe/tlb: print seqno_recv on fence TLB timeout
To help debugging, sample the current seqno_recv and dump it out if we
encounter a TLB timeout for the fences path.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:23 -05:00
Matthew Auld 2ca01fe31b drm/xe/tlb: also update seqno_recv during reset
We might have various kworkers waiting for TLB flushes to complete which
are not tracked with an explicit TLB fence, however at this stage that
will never happen since the CT is already disabled, so make sure we
signal them here under the assumption that we have completed a full GT
reset.

v2:
  - We need to use seqno - 1 here. After acquiring ct->lock the seqno is
    actually the next users seqno and not the pending one.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:23 -05:00
Matthew Auld 4803f6e26f drm/xe/tlb: increment next seqno after successful CT send
If we are in the middle of a GT reset or similar the CT might be
disabled, such that the CT send fails. However we already incremented
gt->tlb_invalidation.seqno which might lead to warnings, since we
effectively just skipped a seqno:

    0000:00:02.0: drm_WARN_ON(expected_seqno != msg[0])

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:22 -05:00
Matthew Auld 86ed09250e drm/xe/tlb: ensure we access seqno_recv once
Ensure we load gt->tlb_invalidation.seqno_recv once, and use that for
our seqno checking. The gt->tlb_invalidation_seqno_past is a shared
global variable and can potentially change at any point here.  However
the checks here need to operate on a stable version of seqno_recv for
this to make any sense.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:22 -05:00
Matthew Auld 38fa29dc2b drm/xe/tlb: drop unnecessary smp_wmb()
wake_up_all() and wait_event_timeout() already have the correct barriers
as per https://www.kernel.org/doc/Documentation/memory-barriers.txt.
This should ensure that the seqno_recv write can't be re-ordered wrt to
the actual wake_up_all() i.e we get woken up but there is no write.  The
reader side with wait_event_timeout() also has the correct barriers.
With that drop the hand rolled smp_wmb(), which is anyway missing some
kind of matching barrier on the reader side.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:22 -05:00
Matthew Brost 21ed3327e3 drm/xe: Add helpers to hide struct xe_vma internals
This will help with the GPUVA port as the internals of struct xe_vma
will change.

v2: Update comment around helpers

Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.kernel.org>
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-21 11:35:18 -05:00
Matt Roper f79ee3013a drm/xe: Add backpointer from gt to tile
Rather than a backpointer to the xe_device, a GT should have a
backpointer to its tile (which can then be used to lookup the device if
necessary).

The gt_to_xe() helper macro (which moves from xe_gt.h to xe_gt_types.h)
can and should still be used to jump directly from an xe_gt to
xe_device.

v2:
 - Fix kunit test build
 - Move a couple changes to the previous patch. (Lucas)

Reviewed-by: Matt Atwood <matthew.s.atwood@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20230601215244.678611-4-matthew.d.roper@intel.com
Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:34:11 -05:00
Matthew Auld 3af4365003 drm/xe: keep pulling mem_access_get further back
Lockdep is unhappy about ggtt->lock -> runtime_pm, where it seems
to think this can somehow get inverted. The ggtt->lock looks like a
potentially sensitive driver lock, so likely a sensible move to never
call the runtime_pm routines while holding it. Actually it looks like
d3cold wants to grab this, so perhaps this can indeed deadlock.

v2:
 - Don't forget about xe_gt_tlb_invalidation_vma(), which now needs
   explicit access_get.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:34:10 -05:00
Matthew Auld 565ce72e1c drm/xe: don't allocate under ct->lock
Seems to be a sensitive lock, where ct->lock looks to be primed with
fs_reclaim, so holding that and then allocating memory will cause
lockdep to complain. We need to change the ordering wrt to grabbing the
ct->lock and potentially grabbing the runtime_pm, since some of the
runtime_pm routines can allocate memory (or at least that's what lockdep
seems to suggest).

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:34:09 -05:00
Matthew Auld a2db319211 drm/xe: fix tlb_invalidation_seqno_past()
Checking seqno_recv >= seqno looks like it will incorrectly report true
when the seqno has wrapped (not unlikely given
TLB_INVALIDATION_SEQNO_MAX). Calling xe_gt_tlb_invalidation_wait() might
then return before the flush has been completed by the GuC.

Fix this by treating a large negative delta as an indication that the
seqno has wrapped around. Similar to how we treat a large positive delta
as an indication that the seqno_recv must have wrapped around, but in
that case the seqno has likely also signalled.

It looks like we could also potentially make the seqno use the full
32bits as supported by the GuC.

Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:33:50 -05:00
Nirmoy Das a5cecbac92 drm/xe: Print GT info on TLB inv failure
Print GT info on TLB inv failure for better debugbility.

Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:33:49 -05:00
Matthew Auld fa4fe0db08 drm/xe/tlb: fix expected_seqno calculation
It looks like when tlb_invalidation.seqno overflows
TLB_INVALIDATION_SEQNO_MAX, we start counting again from one, as per
send_tlb_invalidation(). This is also inline with initial value we give
it in xe_gt_tlb_invalidation_init().  When calculating the
expected_seqno we should also take this into account.

While we are here also print out the values if we ever trigger the
warning.

v2 (José):
  - drm_WARN_ON() is preferred over plain WARN_ON(), since it gives
    information on the originating device.

Closes: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/248
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Reviewed-by: José Roberto de Souza <jose.souza@intel.com>
Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
2023-12-19 18:31:41 -05:00