Commit Graph

486 Commits

Author SHA1 Message Date
Candice Li 46e2231ce0 drm/amdgpu: Log deferred error separately
Separate deferred error from UE and CE and log it
individually.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:37 -05:00
Yang Wang 37973b69ea drm/amdgpu: add aca sysfs support
add aca sysfs node support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:36 -05:00
Yang Wang 04c4fcd263 drm/amdgpu: add amdgpu ras aca query interface
v1:
add ACA error query interface

v2:
Add a new helper function to determine whether to use ACA or MCA.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:36 -05:00
Yang Wang 33dcda51e9 drm/amdgpu: add ACA bank dump debugfs support
add ACA bank dump debugfs support

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:35 -05:00
Hawking Zhang cce4febb27 drm/amdgpu: Add ras helper to query boot errors v2
Add ras helper function to query boot time gpu
errors.
v2: use aqua_vanjaram smn addressing pattern

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Le Ma <le.ma@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-15 18:35:35 -05:00
Hawking Zhang 73cb81dc54 drm/amdgpu: Packed socket_id to ras feature mask
Initialize RAS feature mask bit[31:29] with socket_id.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Candice Li fb1e917199 drm/amdgpu: Support poison error injection via ras_ctrl debugfs
Support poison error injection.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:44:13 -05:00
Candice Li 90bd01471d drm/amdgpu: Drop unnecessary sentences about CE and deferred error.
Remove "no user action is needed" for correctable and deferred error
to avoid confusion.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-09 15:43:54 -05:00
Hawking Zhang 6697dbf0af Revert "drm/amdgpu: enable mca debug mode on APU by default"
Not needed any more with firmware fixes

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-05 16:04:36 -05:00
Srinivasan Shanmugam b8d55a90fd drm/amdgpu: Fix possible NULL dereference in amdgpu_ras_query_error_status_helper()
Return invalid error code -EINVAL for invalid block id.

Fixes the below:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1183 amdgpu_ras_query_error_status_helper() error: we previously assumed 'info' could be null (see line 1176)

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 11:16:06 -05:00
Srinivasan Shanmugam 091411be7a drm/amdgpu: Use kzalloc instead of kmalloc+__GFP_ZERO in amdgpu_ras.c
Fixes the below smatch warnings:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2543 amdgpu_ras_recovery_init() warn: Please consider using kzalloc instead of kmalloc
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2830 amdgpu_ras_init() warn: Please consider using kzalloc instead of kmalloc

Cc: Christian König <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Srinivasan Shanmugam <srinivasan.shanmugam@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2024-01-03 11:16:06 -05:00
YiPeng Chai 9f91e983ee drm/amdgpu: MCA supports recording umc address information
MCA supports recording umc address information.

V2:
  Move err_addr variable from struct ras_err_node to
struct ras_err_info.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-19 14:59:03 -05:00
Dave Airlie c1ee197d64 Linux 6.7-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmV2PMQeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhgQIAKSdjtpkv+IeDwph
 umLHxDlCTT8nl+TAhEYuxSdMiHt1i7+zyF0snnHhGthBRnBZBf+PDdkQOmqtjLZj
 e591VGY7tdNCAVUwmRRaI82JFz9Pl1k8DXL3f+CE1+MfbpsirAecoIAL0rsvg+4f
 ICKSAOBZonL7GT7IGrsbxgqGKCLC+2PcgNT4pADBdjtbC1nmVSLK21dtWr6i4xsn
 CT9MxAnej1+EOpreyHhNZUVqfQy0/04x4OY0u/EOXU3PgpfGTrAzkyfwzcd/eit0
 x59TZ0owDVXfvvdXUUAC0zG1th2ek5LKNtL+C4rKOoAbq7GZoQfjN5M/Ug+yykwE
 aqm6CJ8=
 =tZ+R
 -----END PGP SIGNATURE-----

Backmerge tag 'v6.7-rc5' into drm-next

Linux 6.7-rc5

Alex requested this for some amdkfd work relying on the symbols exports.

Signed-off-by: Dave Airlie <airlied@redhat.com>
2023-12-12 11:32:33 +10:00
Yang Wang dbf3850d12 drm/amdgpu: optimize the printing order of error data
sort error data list to optimize the printing order.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-12-06 16:05:32 -05:00
Yang Wang f875f61b1f drm/amdgpu: enable mca debug mode on APU by default
enable MCA debug mode on APU device by default.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-30 18:26:31 -05:00
Lijo Lazar 201761b5eb drm/amdgpu: Move mca debug mode decision to ras
Refactor code such that ras block decides the default mca debug mode,
and not swsmu block.

By default mca debug mode is set to false.

v2: squash in uninitialized value fix (Alex)

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-29 16:49:01 -05:00
Yang Wang 07ee43faeb drm/amdgpu: fix ras err_data null pointer issue in amdgpu_ras.c
fix ras err_data null pointer issue in amdgpu_ras.c

Fixes: 8cc0f5669e ("drm/amdgpu: Support multiple error query modes")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-17 00:51:58 -05:00
Vitaly Prosyak 4638e0c29a drm/amdgpu: fix software pci_unplug on some chips
When software 'pci unplug' using IGT is executed we got a sysfs directory
entry is NULL for differant ras blocks like hdp, umc, etc.
Before call 'sysfs_remove_file_from_group' and 'sysfs_remove_group'
check that 'sd' is  not NULL.

[  +0.000001] RIP: 0010:sysfs_remove_group+0x83/0x90
[  +0.000002] Code: 31 c0 31 d2 31 f6 31 ff e9 9a a8 b4 00 4c 89 e7 e8 f2 a2 ff ff eb c2 49 8b 55 00 48 8b 33 48 c7 c7 80 65 94 82 e8 cd 82 bb ff <0f> 0b eb cc 66 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90
[  +0.000001] RSP: 0018:ffffc90002067c90 EFLAGS: 00010246
[  +0.000002] RAX: 0000000000000000 RBX: ffffffff824ea180 RCX: 0000000000000000
[  +0.000001] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[  +0.000001] RBP: ffffc90002067ca8 R08: 0000000000000000 R09: 0000000000000000
[  +0.000001] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  +0.000001] R13: ffff88810a395f48 R14: ffff888101aab0d0 R15: 0000000000000000
[  +0.000001] FS:  00007f5ddaa43a00(0000) GS:ffff88841e800000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000001] CR2: 00007f8ffa61ba50 CR3: 0000000106432000 CR4: 0000000000350ef0
[  +0.000001] Call Trace:
[  +0.000001]  <TASK>
[  +0.000001]  ? show_regs+0x72/0x90
[  +0.000002]  ? sysfs_remove_group+0x83/0x90
[  +0.000002]  ? __warn+0x8d/0x160
[  +0.000001]  ? sysfs_remove_group+0x83/0x90
[  +0.000001]  ? report_bug+0x1bb/0x1d0
[  +0.000003]  ? handle_bug+0x46/0x90
[  +0.000001]  ? exc_invalid_op+0x19/0x80
[  +0.000002]  ? asm_exc_invalid_op+0x1b/0x20
[  +0.000003]  ? sysfs_remove_group+0x83/0x90
[  +0.000001]  dpm_sysfs_remove+0x61/0x70
[  +0.000002]  device_del+0xa3/0x3d0
[  +0.000002]  ? ktime_get_mono_fast_ns+0x46/0xb0
[  +0.000002]  device_unregister+0x18/0x70
[  +0.000001]  i2c_del_adapter+0x26d/0x330
[  +0.000002]  arcturus_i2c_control_fini+0x25/0x50 [amdgpu]
[  +0.000236]  smu_sw_fini+0x38/0x260 [amdgpu]
[  +0.000241]  amdgpu_device_fini_sw+0x116/0x670 [amdgpu]
[  +0.000186]  ? mutex_lock+0x13/0x50
[  +0.000003]  amdgpu_driver_release_kms+0x16/0x40 [amdgpu]
[  +0.000192]  drm_minor_release+0x4f/0x80 [drm]
[  +0.000025]  drm_release+0xfe/0x150 [drm]
[  +0.000027]  __fput+0x9f/0x290
[  +0.000002]  ____fput+0xe/0x20
[  +0.000002]  task_work_run+0x61/0xa0
[  +0.000002]  exit_to_user_mode_prepare+0x150/0x170
[  +0.000002]  syscall_exit_to_user_mode+0x2a/0x50

Cc: Hawking Zhang <hawking.zhang@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Signed-off-by: Vitaly Prosyak <vitaly.prosyak@amd.com>
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:02:49 -05:00
Hawking Zhang 8cc0f5669e drm/amdgpu: Support multiple error query modes
Direct error query mode and firmware error query mode
are supported for now.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-09 17:01:58 -05:00
Tao Zhou 18eae367cb drm/amdgpu: check recovery status of xgmi hive in ras_reset_error_count
Handle xgmi hive case.

Suggested-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-11-03 12:18:32 -04:00
Tao Zhou d1d4c0b7b6 drm/amdgpu: check RAS supported first in ras_reset_error_count
Not all platforms support RAS.

Fixes: 73582be11a ("drm/amdgpu: bypass RAS error reset in some conditions")
Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-31 16:40:15 -04:00
Tao Zhou 73582be11a drm/amdgpu: bypass RAS error reset in some conditions
PMFW is responsible for RAS error reset in some conditions, driver can
skip the operation.

v2: add check for ras->in_recovery, it's set earlier than
amdgpu_in_reset.

v3: fix error in gpu reset check.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
Tao Zhou f3a3bbf156 drm/amdgpu: enable RAS poison mode for APU
Enable it by default on APU platform.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:22 -04:00
Yang Wang ec3e0a9167 drm/amdgpu: refine ras error kernel log print
refine ras error kernel log to avoid user-ridden ambiguity.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:21 -04:00
Yang Wang 53d4d77927 drm/amdgpu: fix find ras error node error
the origin function might return the wrong node.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-26 18:41:21 -04:00
Tao Zhou 8096df7664 drm/amdgpu: add set/get mca debug mode operations
Record the debug mode status in RAS.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Stanley.Yang 66d64e4e03 drm/amdgpu: Enable RAS feature by default for APU
Enable RAS feature by default for aqua vanjaram on apu
platform.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Yang Wang 49c260bef3 drm/amdgpu: fix typo for amdgpu ras error data print
typo fix.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:28 -04:00
Stanley.Yang 8a65661114 drm/amdgpu: Fix delete nodes that have been relesed
Fix delete nodes that it has been freed.

Fixes: 5b1270beb3 ("drm/amdgpu: add ras_err_info to identify RAS error source")
Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:27 -04:00
Hawking Zhang 9248462d7e drm/amdgpu: Enable software RAS in vcn v4_0_3
Set VCN/JPEG RAS masks to enable software RAS for
VCN and JPEG.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Tao Zhou 472c5fb297 drm/amdgpu: define ras_reset_error_count function
Make the code architecture more simple.

v2: reuse ras_reset_error_count in ras_reset_error_status.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-20 15:11:26 -04:00
Asad Kamal 53dd920c1f drm/amdgpu : Add hive ras recovery check
If one of the devices in the hive detects a
fatal error, need to send ras recovery reset
message to PMFW of all devices in the hive.
For that add a flag in hive to indicate that
it's undergoing ras recovery

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-19 18:26:51 -04:00
Yang Wang 5b1270beb3 drm/amdgpu: add ras_err_info to identify RAS error source
introduced "ras_err_info" to better identify a RAS ERROR source.

NOTE:
For legacy chips, keep the original RAS error print format.

v1:
RAS errors may come from different dies during a RAS error query,
therefore, need a new data structure to identify the source of RAS ERROR.

v2:
- use new data structure 'amdgpu_smuio_mcm_config_info' instead of
  ras_err_id (in v1 patch)
- refine ras error dump function name
- refine ras error dump log format

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:35:35 -04:00
Asad Kamal 625e5f3851 drm/amdgpu: Expose ras version & schema info
Expose ras table version & schema info to sysfs

v2: Updated schema to get poison support info
from ras context, removed asic specific checks

Signed-off-by: Asad Kamal <asad.kamal@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-10-13 11:02:54 -04:00
Stanley.Yang a83f2bf1f4 drm/amdgpu: Fix false positive error log
It should first check block ras obj whether be set, it should
return 0 directly if block ras obj or hw_ops is not set.
If block doesn't support RAS just return 0 is fine.

Changed from V1:
	return 0 directly if block ras obj or hw ops is not set

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:09 -04:00
Cong Liu 5838f74c29 drm/amdgpu: fix a memory leak in amdgpu_ras_feature_enable
This patch fixes a memory leak in the amdgpu_ras_feature_enable() function.
The leak occurs when the function sends a command to the firmware to enable
or disable a RAS feature for a GFX block. If the command fails, the kfree()
function is not called to free the info memory.

Fixes: 9f051d6ff1 ("drm/amdgpu: Free ras cmd input buffer properly")
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Cong Liu <liucong2@kylinos.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 16:24:07 -04:00
Yang Wang 4051844c66 drm/amdgpu: add amdgpu mca debug sysfs support
add amdgpu mca debug sysfs support.

Signed-off-by: Yang Wang <kevinyang.wang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 12:25:19 -04:00
Lijo Lazar 4e8303cf2c drm/amdgpu: Use function for IP version check
Use an inline function for version check. Gives more flexibility to
handle any format changes.

Signed-off-by: Lijo Lazar <lijo.lazar@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-20 12:23:28 -04:00
Hawking Zhang 9f9d4651f7 drm/amdgpu: fallback to old RAS error message for aqua_vanjaram
So driver doesn't generate incorrect message until
the new format is settled down for aqua_vanjaram

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Yang Wang <kevinyang.wang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-09-11 17:15:02 -04:00
Hawking Zhang bf7aa8bea9 drm/amdgpu: Free ras cmd input buffer properly
Do not access the pointer for ras input cmd buffer
if it is even not allocated.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 15:51:16 -04:00
Hawking Zhang ec70578c83 drm/amdgpu: Allow issue disable gfx ras cmd to firmware
Disable gfx ras command is needed in some use cases

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 15:25:54 -04:00
YiPeng Chai 80578f1641 drm/amdgpu: Enable ras for mp0 v13_0_6 sriov
Enable ras for mp0 v13_0_6 sriov

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-30 14:58:16 -04:00
YiPeng Chai 1b98a5f8e0 drm/amdgpu: mode1 reset needs to recover mp1 for mp0 v13_0_10
Mode1 reset needs to recover mp1 in fatal error case
for mp0 v13_0_10.

v2:
  Define a macro to wrap psp function calls.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-15 18:07:41 -04:00
Hawking Zhang 8b3a7a707c drm/amdgpu: Remove unnecessary ras cap check
RAS global isr will only be invoked by hardware
interrupt. Don't need to query ras capability in isr
In addition, amdgpu_ras_interrupt_fatal_error_handler
ensures the isr won't be called from guest linux
side by accident. The RAS cap check in isr that
introduced to fix sriov crash is not needed any more

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-15 18:07:41 -04:00
Candice Li bc0f80802d drm/amdgpu: Extend poison mode check to SDMA/VCN/JPEG
Treat SDMA/VCN/JPEG as RAS capable IP blocks in poison mode.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:46:05 -04:00
Tao Zhou 7692e1ee24 drm/amdgpu: add RAS fatal error handler for NBIO v7.9
Register RAS fatal error interrupt and add handler.

v2: only register NBIO RAS for dGPU platform.
    change nbio_v7_9_set_ras_controller_irq_state and nbio_v7_9_set_ras_err_event_athub_irq_state
    to dummy functions.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-09 09:46:04 -04:00
Hawking Zhang 6fc9d92c3d drm/amdgpu: Issue ras enable_feature for gfx ip only
For non-GFX IP blocks, set up ras obj if ras feature
is allowed. For GFX IP blocks, force issue ras
enable_feature command to firmware and only set up ras
obj if ras feature is allowed

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Hawking Zhang 62c4b772bd drm/amdgpu: Apply poison mode check to GFX IP only
For GFX IP that only supports poison consumption, GFX
RAS won't be marked as enabled. i.e., hardware doesn't
support gfx sram ecc. But driver still needs to issue
firmware to enable poison consumption mode for GFX IP.
In such case, check poison mode and treat GFX IP as
RAS capable IP block.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Hawking Zhang f957138cc3 drm/amdgpu: Only create err_count sysfs when hw_op is supported
Some IP blocks only support partial ras feature and don't
have ras counter and/or ras error status register at all.
Driver should not create err_count sysfs node for those
IP blocks.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-08-07 17:12:49 -04:00
Stanley.Yang fcb7a1849a drm/amdgpu: Check APU flag to disable RAS
Only disable RAS by default for aqua vanjaram on APU platform.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-25 13:47:26 -04:00
Mario Limonciello adf64e2142 drm/amd: Avoid reading the VBIOS part number twice
The VBIOS part number is read both in amdgpu_atom_parse() as well
as in atom_get_vbios_pn() and stored twice in the `struct atom_context`
structure. Remove the first unnecessary read and move the `pr_info`
line from that read into the second.

v2: squash in unused variable removal

Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-21 16:52:24 -04:00
Stanley.Yang 276f6e8cb7 drm/amdgpu: Disable RAS by default on APU flatform
Disable RAS feature by default for aqua vanjaram on APU platform.

Changed from V1:
	Splite Disable RAS by default on APU platform into a
	separated patch.

Changed from V2:
	Avoid to modify global variable amdgpu_ras_enable.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:11:36 -04:00
Stanley.Yang cb906ce32b drm/amdgpu: Enable aqua vanjaram RAS
Enable RAS for aqua vanjaram.

Changed from V1:
	Split the change in amdgpu_ras_asic_supported into a
	separated patch.

Changed from V2:
	Avoid to modify global variable amdgpu_ras_enable.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-18 11:11:23 -04:00
Tao Zhou a80fe1a698 drm/amdgpu: skip address adjustment for GFX RAS injection
The address parameter of GFX RAS injection isn't related to XGMI node
number, keep it unchanged.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-07-07 13:51:47 -04:00
YiPeng Chai 2c7cd280e5 drm/amdgpu: gpu recovers from fatal error in poison mode
Fatal error occurs in ras poison mode, mode1 reset
is used to recover gpu.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-30 13:12:15 -04:00
Stanley.Yang 43aedbf4da drm/amdgpu: Add checking mc_vram_size
Do not compare injection address with mc_vram_size
if mc_vram_size is zero.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:59 -04:00
Stanley.Yang 38298ce6fc drm/amdgpu: Optimize checking ras supported
Using "is_app_apu" to identify device in the native
APU mode or carveout mode.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:59 -04:00
Luben Tuikov 71344a718a drm/amdgpu: Fix usage of UMC fill record in RAS
The fixed commit listed in the Fixes tag below, introduced a bug in
amdgpu_ras.c::amdgpu_reserve_page_direct(), in that when introducing the new
amdgpu_umc_fill_error_record() and internally in that new function the physical
address (argument "uint64_t retired_page"--wrong name) is right-shifted by
AMDGPU_GPU_PAGE_SHIFT. Thus, in amdgpu_reserve_page_direct() when we pass
"address" to that new function, we should NOT right-shift it, since this
results, erroneously, in the page address to be 0 for first
2^(2*AMDGPU_GPU_PAGE_SHIFT) memory addresses.

This commit fixes this bug.

Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>
Fixes: 400013b268 ("drm/amdgpu: add umc_fill_error_record to make code more simple")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230610113536.10621-1-luben.tuikov@amd.com
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:58 -04:00
Luben Tuikov 740f42a28f drm/amdgpu: Report ras_num_recs in debugfs
Report the number of records stored in the RAS EEPROM table in debugfs.

This can be used by user-space to calculate the capacity of the RAS EEPROM
table since "bad_page_cnt_threshold" is also reported in the same place in
debugfs.

See commit 7fb6407145 ("drm/amdgpu: Add bad_page_cnt_threshold to debugfs").

ras_num_recs can already be inferred by dumping the RAS EEPROM table, also in
the same debugfs location, see commit reference c65b0805e7 (drm/amdgpu:
RAS EEPROM table is now in debugfs, 2021-04-08). This commit makes it an
integer value easily shown in a single file.

Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: Stanley Yang <Stanley.Yang@amd.com>
Cc: John Clements <john.clements@amd.com>
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Link: https://lore.kernel.org/r/20230603051043.211548-1-luben.tuikov@amd.com
Acked-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-15 11:06:58 -04:00
Stanley.Yang 7f599fed3b drm/amdgpu: Add support EEPROM table v2.1
Add ras info to EEPROM table, app can analyse device ECC
status without GPU driver through EEPROM table ras info.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:44:34 -04:00
Stanley.Yang e3959cb547 drm/amdgpu: support check vcn jpeg block mask
Support VCN/JPEG instance mask checking, pass logical
mask directly except GFX/SDMA/VCN/JPEG blocks.

Changed from V1:
	correct a typo

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 12:39:54 -04:00
YiPeng Chai 6c47a79b3b drm/amdgpu: perform mode2 reset for sdma fed error on gfx v11_0_3
perform mode2 reset for sdma fed error on gfx v11_0_3.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:38:19 -04:00
Tao Zhou f464c5dd4d drm/amdgpu: add check for RAS instance mask
The mask is only needed to be set when RAS block instance number is
more than 1 and invalid bits should be also masked out.
We only check valid bits for GFX and SDMA block for now, and will
add check for other RAS blocks in the future.

v2: move the check under injection operation since the mask is only
    used by RAS error inject.
v3: add valid bits handling for SDMA.
v4: print message if the mask is adjusted.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:18 -04:00
Tao Zhou 27c5f29526 drm/amdgpu: reorganize RAS injection flow
So GFX RAS injection could use default function if it doesn't define its
own injection interface.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:13 -04:00
Tao Zhou 2c22ed0bdb drm/amdgpu: add instance mask for RAS inject
User can specify injected instances by the mask. For backward
compatibility, the mask value is incorporated into sub block index
without interface change of RAS TA.
User uses logical mask and driver should convert it to physical value
before sending it to RAS TA.

v2: update parameter name.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 10:37:10 -04:00
Hawking Zhang 9b337b7d62 drm/amdgpu: Adjust the sequence to query ras error info
It turns out STATUS_VALID_FLAG needs to be checked
ahead of any other fields. ADDRESS_VALID_FLAG and
ERR_INFO_VALID_FLAG only manages ADDRESS and ERR_INFO
field respectively. driver should continue poll
ERR CNT field even ERR_INFO_VALD_FLAG is not set.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:58:32 -04:00
Hawking Zhang 8107e4996f drm/amdgpu: Enable persistent edc harvesting in APP APU
Persistent edc harvesting is supported in APP APU

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:53:45 -04:00
Hawking Zhang e53a3250f7 drm/amdgpu: Add common helper to reset ras error
Add common helper to reset ras error status. It
applies to IP blocks that follow the new ras error
logging register design, and need to write 0 to
reset the error status. For IP blocks that don't
support the new design, please still implement ip
specific helper.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:52:54 -04:00
Hawking Zhang 322a7e005d drm/amdgpu: Add common helper to query ras error (v2)
Add common helper to query ras error status and
log error information, including memory block id
and erorr count. The helpers are applicable to IP
blocks that follow the new ras error logging design.
For IP blocks that don't support the new design,
please still implement ip specific helper to query
ras error.

v2: optimize struct amdgpu_ras_err_status_reg_entry
and the implementaion in helper (Lijo/Tao)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-06-09 09:52:50 -04:00
Candice Li 8eba72053c drm/amdgpu: Drop pcie_bif ras check from fatal error handler
Some ASICs support fatal error event but do not
support pcie_bif ras.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-21 08:49:37 -04:00
Jane Jian b805d8d785 Revert "drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV"
This reverts commit fe120b9f5c.
This patch impacts sriov multi-vf stability

Signed-off-by: Jane Jian <Jane.Jian@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-14 13:47:49 -04:00
Stanley.Yang 58bc2a9cbf drm/amdgpu: correct ras enabled flag
XGMI RAS should be according to the gmc xgmi physical nodes number,
XGMI RAS should not be enabled if xgmi num_physical_nodes is zero.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-04-11 18:03:45 -04:00
Hawking Zhang 9af357bc3e drm/amdgpu: Add fatal error handling in nbio v4_3
GPU will stop working once fatal error is detected.
it will inform driver to do reset to recover from
the fatal error.

v2: squash in logic fix (Srinivasan)
v3: squash in logic fix (Dan)

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Candice Li <candice.li@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-31 11:18:32 -04:00
YiPeng Chai fe120b9f5c drm/amdgpu: enable ras for mp0 v13_0_10 on SRIOV
Enable ras for mp0 v13_0_10 on SRIOV.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-22 00:58:37 -04:00
Hawking Zhang 370808876b drm/amdgpu: drop ras check at asic level for new blocks
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Hawking Zhang fdc94d3a8c drm/amdgpu: Rework pcie_bif ras sw_init
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-03-15 18:45:27 -04:00
Tao Zhou f3cbe70e21 drm/amdgpu: change default behavior of bad_page_threshold parameter
Ignore ras umc bad page threshold by default, GPU initialization won't
be stopped in this mode.

v2: refine the description of bad_page_threshold.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23 17:35:59 -05:00
Tao Zhou 4d33e0f134 drm/amdgpu: exclude duplicate pages from UMC RAS UE count
If a UMC bad page is reserved but not freed by an application, the
application may trigger uncorrectable error repeatly by accessing the page.

v2: add specific function to do the check.
v3: remove duplicate pages, calculate new added bad page number.
v4: reuse save_bad_pages to calculate new added bad page number.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Stanley.Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-02-23 17:35:59 -05:00
YiPeng Chai 8f453c51cf drm/amdgpu: Adjust ras support check condition for special asic
[Why]:
     Amdgpu ras uses amdgpu_ras_is_supported to check whether
  the ras block supports the ras function. amdgpu_ras_is_supported
  uses .ras_enabled to determine whether the ras function of the
  block is enabled.
     But for special asic with mem ecc enabled but sram ecc not
  enabled, some ras blocks support poison mode but their ras function
  is not enabled on .ras_enabled, these ras blocks will run abnormally.

[How]:
    If the ras block is not supported on .ras_enabled but the asic
  supports poison mode and the ras block has ras configuration, it
  can be considered that the ras block supports ras function.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
YiPeng Chai 8c305a3fdf drm/amdgpu: Remove unnecessary ras block support check
[Why]:
   For special asic with mem ecc enabled but sram ecc
not enabled, some ras blocks can register their ras
configuration to ras list, but these ras blocks are not
enabled on .ras_enabled, so it can not get ras block
object using amdgpu_ras_get_ras_block.

[How]:
   Remove ras block support check. Even if the ras block
checked is not in the ras list, it will return a null
pointer and will have no effect.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
YiPeng Chai ac7b25d92c drm/amdgpu: Perform gpu reset after gfx finishes processing ras poison consumption on gfx_v11_0_3
Perform gpu reset after gfx finishes processing
ras poison consumption on gfx_v11_0_3.

V2:
 Move gfx poison consumption handler from hw_ops to ip
 function level.

V3:
 Adjust the calling position of amdgpu_gfx_poison_consumation_handler.

V4:
   Since gfx v11_0_3 does not have .hw_ops instance, the .hw_ops null
 pointer check in amdgpu_ras_interrupt_poison_consumption_handler
 needs to be adjusted.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-17 16:11:51 -05:00
Hawking Zhang 4a1c9a444b drm/amdgpu: allow query error counters for specific IP block
amdgpu_ras_block_late_init will be invoked in IP
specific ras_late_init call as a common helper for
all the IP blocks.

However, when amdgpu_ras_block_late_init call
amdgpu_ras_query_error_count to query ras error
counters, amdgpu_ras_query_error_count queries
all the IP blocks that support ras query interface.

This results to wrong error counters cached in
software copies when there are ras errors detected
at time zero or warm reset procedure. i.e., in
sdma_ras_late_init phase, it counts on sdma/mmhub
errors, while, in mmhub_ras_late_init phase, it
still counts on sdma/mmhub errors.

The change updates amdgpu_ras_query_error_count
interface to allow query specific ip error counter.
It introduces a new input parameter: query_info. if
query_info is NULL,  it means query all the IP blocks,
otherwise, only query the ip block specified by
query_info.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-05 11:42:14 -05:00
Stanley.Yang c26cd99918 drm/amdgpu: remove enable ras cmd call trace
[Why]
    [   41.285804] RIP: 0010:amdgpu_ras_feature_enable+0x15c/0x310 [amdgpu]
    [   41.285945] Code: 48 89 c1 48 c7 c2 b9 f2 88 c1 48 c7 c0 c0 f2 88 c1 49 8b 3c 24 48 0f 44 d0 48 c7 c6 98 33 80 c1 e8 5f 52 75 d9 e9 fa fe ff ff <0f> 0b e9 66 ff ff ff 48 8b 3d 86 8c 0f da ba 00 04 00 00 be c0 0d
    [   41.285946] RSP: 0018:ffffbccdc72efc90 EFLAGS: 00010246
    [   41.285948] RAX: 0000000000000004 RBX: ffff931897406980 RCX: 0000000000000002
    [   41.285949] RDX: 0000000000000dc0 RSI: 0000000000000002 RDI: ffff931500042b00
    [   41.285950] RBP: ffffbccdc72efcc0 R08: 0000000000000002 R09: ffff931885b87000
    [   41.285951] R10: 0000000000ffff10 R11: 0000000000000001 R12: ffff931893e20000
    [   41.285952] R13: 0000000000000001 R14: ffff931885b87000 R15: 0000000000000000
    [   41.285953] FS:  0000000000000000(0000) GS:ffff931c6f200000(0000) knlGS:0000000000000000
    [   41.285954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   41.285955] CR2: 000055dd6f532008 CR3: 000000061b010006 CR4: 00000000003706e0
    [   41.285956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [   41.285957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [   41.285958] Call Trace:
    [   41.285959]  <TASK>
    [   41.285963]  ? gfx_v11_0_early_init+0x250/0x250 [amdgpu]
    [   41.286117]  gfx_v11_0_late_init+0x8c/0xb0 [amdgpu]
    [   41.286271]  amdgpu_device_ip_late_init+0x8d/0x3c0 [amdgpu]
    [   41.286401]  amdgpu_device_init.cold+0x1677/0x1fda [amdgpu]
    [   41.286616]  ? pci_bus_read_config_word+0x4a/0x70
    [   41.286621]  ? do_pci_enable_device+0xdb/0x110
    [   41.286625]  amdgpu_driver_load_kms+0x1a/0x160 [amdgpu]
    [   41.286762]  amdgpu_pci_probe+0x18d/0x3a0 [amdgpu]
    [   41.286898]  local_pci_probe+0x4b/0x90
    [   41.286901]  work_for_cpu_fn+0x1a/0x30
    [   41.286903]  process_one_work+0x22b/0x3d0
    [   41.286905]  worker_thread+0x223/0x420
    [   41.286907]  ? process_one_work+0x3d0/0x3d0
    [   41.286908]  kthread+0x12a/0x150
    [   41.286911]  ? set_kthread_struct+0x50/0x50
    [   41.286913]  ret_from_fork+0x22/0x30

[How]
    For specific asic, only mem ecc is enabled, sram ecc is not enabled,
    but it still need to send ras enable cmd to gfx block to support
    poison mode, so add check posion mode.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2023-01-03 16:57:58 -05:00
Tao Zhou 2dd9032beb drm/amdgpu: define RAS query poison mode function
1. no need to query poison mode on SRIOV guest side, host can handle it.
2. define the function to simplify code.

v2: rename amdgpu_ras_poison_mode_query to amdgpu_ras_query_poison_mode.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
Tao Zhou 3189501e6f drm/amdgpu: update VCN/JPEG RAS setting
Support VCN/JPEG RAS in both bare metal and SRIOV environment.

v2: update commit description.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
Tao Zhou 248c9635b8 drm/amdgpu: skip RAS error injection in SRIOV
Injection on guest is not allowed.

v2: return directly in SRIOV environment.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-15 12:18:19 -05:00
ye xingchen 2cffcb6679 drm/amdgpu: use sysfs_emit() to instead of scnprintf()
Replace the open-code with sysfs_emit() to simplify the code.

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: ye xingchen <ye.xingchen@zte.com.cn>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-12-01 15:21:34 -05:00
Tao Zhou 07615da1bf drm/amdgpu: enable RAS for VCN/JPEG v4.0
Set support flag for VCN/JPEG 4.0.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17 18:07:58 -05:00
YiPeng Chai 1a11a65d53 drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error mode
The patch is enabling mode-1 reset for RAS recovery in fatal error mode.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-11-17 18:07:52 -05:00
Tao Zhou 1ed0e17690 drm/amdgpu: remove ras_error_status parameter for UMC poison handler
Make the code simpler.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:08 -04:00
Tao Zhou 24b822928b drm/amdgpu: use page retirement API in MCA notifier
Make the code more readable.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-27 15:12:08 -04:00
Hawking Zhang 6c0ca74820 drm/amdgpu: move convert_error_address out of umc_ras
RAS error address translation algorithm is common
across dGPU and A + A platform as along as the SOC
integrates the same generation of UMC IP.

UMC RAS is managed by x86 MCA on A + A platform,
umc_ras in GPU driver is not initialized at all on
A + A platform. In such case, any umc_ras callback
implemented for dGPU config shouldn't be invoked
from A + A specific callback.

The change moves convert_error_address out of dGPU
umc_ras structure and makes it share between A + A
and dGPU config.

Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Stanley Yang <Stanley.Yang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:21 -04:00
YiPeng Chai 82835055c6 drm/amdgpu: Add sriov vf ras support in amdgpu_ras_asic_supported
V2:
Add sriov vf ras support in amdgpu_ras_asic_supported.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:20 -04:00
YiPeng Chai 073285efde drm/amdgpu: Enable ras support for mp0 v13_0_0 and v13_0_10
V1:
Enable ras support for CHIP_IP_DISCOVERY asic type.

V2:
1. Change commit comment.
2. Enable ras support for mp0 v13_0_0 and v13_0_10.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:20 -04:00
Victor Zhao b98a1648d6 Revert "drm/amdgpu: let mode2 reset fallback to default when failure"
This reverts commit dac6b80818.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b80818 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-17 17:41:20 -04:00
Tao Zhou 38dbbfa57c drm/amdgpu: fix coding style issue for mca notifier
Fix some issues found by checkpatch script.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-11 11:05:29 -04:00
Tao Zhou 44420ac5f8 drm/amdgpu: define RAS convert_error_address API
Make the code reusable and remove redundant code.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-10-11 11:05:17 -04:00
Tao Zhou cd4c99f103 drm/amdgpu: use RAS error address convert api in mca notifier
Use the convert interface to simplify code.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:42:31 -04:00
YiPeng Chai 642c040113 drm/amdgpu: Fixed ras warning when uninstalling amdgpu
For the asic using smu v13_0_2, there is the following
warning when uninstalling amdgpu:
  amdgpu: ras disable gfx failed poison:1 ret:-22.

[Why]:
  For the asic using smu v13_0_2, the psp .suspend and
  mode1reset is called before amdgpu_ras_pre_fini during
  amdgpu uninstall, it has disabled all ras features and
  reset the psp. Since the psp is reset, calling
  amdgpu_ras_disable_all_features in amdgpu_ras_pre_fini
  to disable ras features will fail.

[How]:
  If all ras features are disabled, amdgpu_ras_disable_all_features
  will not be called to disable all ras features again.

Signed-off-by: YiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-29 09:41:42 -04:00
Candice Li 6da15a236c drm/amdgpu: Skip reset error status for psp v13_0_0
No need to reset error status since only umc ras supported on psp v13_0_0.

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-09-13 14:32:59 -04:00
Victor Zhao dac6b80818 drm/amdgpu: let mode2 reset fallback to default when failure
- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-08-16 18:14:31 -04:00
Likun Gao f1549c09c5 drm/amdgpu: support reset flag set for gpu reset
Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.

Signed-off-by: Likun Gao <Likun.Gao@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-13 11:25:17 -04:00
Stanley.Yang e0e146d556 drm/amdgpu: skip whole ras bad page framework on sriov
It should not init whole ras bad page framework on sriov guest side
due to it is handled on host side.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-05 16:10:20 -04:00
Stanley.Yang 26093ce14b drm/amdgpu: Only send ras feature for gfx block
GFX is the only IP block that RAS TA needs to program
the hardware when receiving enable_feature command.

Changed from V1:
    remove amdgpu_ras_need_send_ras_feature inline function,
    use GFX RAS block check directly.

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-07-05 16:10:13 -04:00
Andrey Grodzovsky cf72704414 drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover
We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:26:12 -04:00
Andrey Grodzovsky 25a2b22e41 drm/admgpu: Serialize RAS recovery work directly into reset domain queue.
Save the extra usless work schedule.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-10 15:25:42 -04:00
Candice Li 2a46096335 drm/amdgpu: Resolve RAS GFX error count issue after cold boot on Arcturus
Adjust the sequence for ras late init and separate ras reset error status
from query status.

v2: squash in fix from Candice

Signed-off-by: Candice Li <candice.li@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-01 15:58:45 -04:00
Stanley.Yang 28caf8c467 drm/amdgpu: fix ras supported check
Fix aldebaran ras supported check on SRIOV guest side,
the previous check conditicon block all ras feature
on baremetal

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-06-01 15:58:09 -04:00
Stanley.Yang 950d64250f drm/amdgpu: support ras on SRIOV
support umc/gfx/sdma ras on guest side

Changed from V1:
    move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-26 14:56:32 -04:00
Tao Zhou b63ac5d303 drm/amdgpu: refine RAS poison consumption handler
Qeury ras status before ras poison consumption handling, add more
comment and log.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-and-tested-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-10 17:53:12 -04:00
Tao Zhou 3678060682 drm/amdgpu: enable RAS IH for poison consumption
Enable RAS IH if poison consumption handler is implemented.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-05-10 17:53:12 -04:00
Haowen Bai b3ef3205bc drm/amdgpu: Remove useless kfree
After alloc fail, we do not need to kfree.

Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-25 17:05:38 -04:00
Tao Zhou b3c76814ce drm/amdgpu: add RAS fatal error interrupt handler
The fatal error handler is independent from general ras interrupt
handler since there is no related IH ring.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-22 14:50:18 -04:00
Tao Zhou 66f8794961 drm/amdgpu: add RAS poison consumption handler (v2)
Add support for general RAS poison consumption handler.

v2: remove callback function for poison consumption.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-22 14:50:13 -04:00
Tao Zhou 50a7d025ca drm/amdgpu: add RAS poison creation handler (v2)
Prepare for the implementation of poison consumption handler.

v2: separate umc handler from poison creation.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-04-22 14:50:07 -04:00
Mohammad Zafar Ziya a3d63c62bd drm/amdgpu: Add vcn and jpeg ras support flag
Add vcn and jpeg ras support options

V2: vcn and jpeg ras flag enabled for aldebaran asic only

V3: vcn and jpeg ras flag disabled for error counter query
Generic poison query interface added
VCN and JPEG ras enabled based on IP version check

V4: vcn and jpeg ras flag moved under ecc flag for dGPU

Signed-off-by: Mohammad Zafar Ziya <Mohammadzafar.ziya@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-28 12:54:39 -04:00
Stanley.Yang 69691c8235 drm/amdgpu: message smu to update bad channel info
It should notice SMU to update bad channel info when detected
uncorrectable error in UMC block

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-15 14:25:16 -04:00
yipechai 80e0c2cb37 drm/amdgpu: Remove redundant .ras_fini initialization in some ras blocks
1. Define amdgpu_ras_block_late_fini_default in amdgpu_ras.c as
   .ras_fini common function, which is called when
   .ras_fini of ras block isn't initialized.
2. Remove the code of using amdgpu_ras_block_late_fini to
   initialize .ras_fini in ras blocks.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:06 -05:00
yipechai 1f211a827c drm/amdgpu: centrally calls the .ras_fini function of all ras blocks
centrally calls the .ras_fini function of all ras blocks.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-03-02 18:40:05 -05:00
yipechai 29c9b6cd58 drm/amdgpu: Fixed warning reported by kernel test robot
Fixed warning reported by kernel test robot:
1.warning: variable 'ras_obj' is used uninitialized
  whenever '||' condition is true.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:26:35 -05:00
Maíra Canal d41ff22a4e drm/amdgpu: Change amdgpu_ras_block_late_init_default function scope
Turn previously global function into a static function to avoid the
following Clang warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:5: warning: no previous prototype
for function 'amdgpu_ras_block_late_init_default' [-Wmissing-prototypes]
int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev,
    ^
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2459:1: note: declare 'static' if the
function is not intended to be used outside of this translation unit
int amdgpu_ras_block_late_init_default(struct amdgpu_device *adev,
^
static

Signed-off-by: Maíra Canal <maira.canal@usp.br>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-23 14:02:51 -05:00
Tom Rix 779596ce6a drm/amdgpu: fix amdgpu_ras_block_late_init error handler
Clang build fails with
amdgpu_ras.c:2416:7: error: variable 'ras_obj' is used uninitialized
  whenever 'if' condition is true
  if (adev->in_suspend || amdgpu_in_reset(adev)) {
  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

amdgpu_ras.c:2453:6: note: uninitialized use occurs here
 if (ras_obj->ras_cb)
     ^~~~~~~

There is a logic error in the error handler's labels.
ex/ The sysfs: is the last goto label in the normal code but
is the middle of error handler.  Rework the error handler.

cleanup: is the first error, so it's handler should be last.

interrupt: is the second error, it's handler is next.  interrupt:
handles the failure of amdgpu_ras_interrupt_add_hander() by
calling amdgpu_ras_interrupt_remove_handler().  This is wrong,
remove() assumes the interrupt has been setup, not torn down by
add().  Change the goto label to cleanup.

sysfs is the last error, it's handler should be first.  sysfs:
handles the failure of amdgpu_ras_sysfs_create() by calling
amdgpu_ras_sysfs_remove().  But when the create() fails there
is nothing added so there is nothing to remove.  This error
handler is not needed. Remove the error handler and change
goto label to interrupt.

Fixes: b293e891b0 ("drm/amdgpu: add helper function to do common ras_late_init/fini (v3)")
Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:05 -05:00
yipechai 418abce203 drm/amdgpu: Remove redundant .ras_late_init initialization in some ras blocks
1. Define amdgpu_ras_block_late_init_default in amdgpu_ras.c as
   .ras_late_init common function, which is called when
   .ras_late_init of ras block isn't initialized.
2. Remove the code of using amdgpu_ras_block_late_init to
   initialize .ras_late_init in ras blocks.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:05 -05:00
yipechai 867e24ca49 drm/amdgpu: define amdgpu_ras_late_init to call all ras blocks' .ras_late_init
Define amdgpu_ras_late_init to call all ras blocks' .ras_late_init.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-17 15:59:05 -05:00
yipechai 563285c85e drm/amdgpu: Merge amdgpu_ras_late_init/amdgpu_ras_late_fini to amdgpu_ras_block_late_init/amdgpu_ras_block_late_fini
1. Merge amdgpu_ras_late_init to
   amdgpu_ras_block_late_init.
2. Remove amdgpu_ras_late_init since no ras block
   calls amdgpu_ras_late_init.
3. Merge amdgpu_ras_late_fini to
   amdgpu_ras_block_late_fini.
4. Remove amdgpu_ras_late_fini since no ras block
   calls amdgpu_ras_late_fini.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
yipechai 9252d33df5 drm/amdgpu: Optimize operating sysfs and interrupt function interface in amdgpu_ras.c
In order to reduce redundant struct conversion, modify
operating sysfs and interrupt function interface parameters.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
yipechai 80ed77f971 drm/amdgpu: Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code
Optimize amdgpu_nbio_ras_late_init/amdgpu_nbio_ras_fini function code.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:41 -05:00
yipechai bdb3489cfc drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras block
1. Define amdgpu_ras_block_late_init to create sysfs nodes
   and interrupt handles.
2. Define amdgpu_ras_block_late_fini to remove sysfs nodes
   and interrupt handles.
3. Replace ras block variable members in struct
   amdgpu_ras_block_object with struct ras_common_if, which
   can make it easy to associate each ras block instance
   with each ras block functional interface.
4. Add .ras_cb to struct amdgpu_ras_block_object.
5. Change each ras block to fit for the changement of struct
   amdgpu_ras_block_object.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-14 15:08:40 -05:00
Dave Airlie 123db17ddf Merge tag 'amd-drm-next-5.18-2022-02-11-1' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-5.18-2022-02-11-1:

amdgpu:
- Clean up of power management code
- Enable freesync video mode by default
- Clean up of RAS code
- Improve VRAM access for debug using SDMA
- Coding style cleanups
- SR-IOV fixes
- More display FP reorg
- TLB flush fixes for Arcuturus, Vega20
- Misc display fixes
- Rework special register access methods for SR-IOV
- DP2 fixes
- DP tunneling fixes
- DSC fixes
- More IP discovery cleanups
- Misc RAS fixes
- Enable both SMU i2c buses where applicable
- s2idle improvements
- DPCS header cleanup
- Add new CAP firmware support for SR-IOV

amdkfd:
- Misc cleanups
- SVM fixes
- CRIU support
- Clean up MQD manager

UAPI:
- Add interface to amdgpu CTX ioctl to request a stable power state for profiling
  https://gitlab.freedesktop.org/mesa/drm/-/merge_requests/207
- Add amdkfd support for CRIU
  https://github.com/checkpoint-restore/criu/pull/1709
- Remove old unused amdkfd debugger interface
  Was only implemented for Kaveri and was only ever used by an old HSA tool that was never open sourced

radeon:
- Fix error handling in radeon_driver_open_kms
- UVD suspend fix
- Misc fixes

From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20220211220706.5803-1-alexander.deucher@amd.com
2022-02-14 10:31:51 +10:00
yipechai a50b048276 Revert "drm/amdgpu: Add judgement to avoid infinite loop"
The commit d5e8ff5f7b ("drm/amdgpu: Fixed the defect of soft lock caused by infinite loop")
had fixed this defect.

Revert workaround
commit a2170b4af6 ("drm/amdgpu: Add judgement to avoid infinite loop").

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 18:00:05 -05:00
yipechai d5e8ff5f7b drm/amdgpu: Fixed the defect of soft lock caused by infinite loop
1. The infinite loop case only occurs on multiple cards support
   ras functions.
2. The explanation of root cause refer to commit 76641cbbf196
   ("drm/amdgpu: Add judgement to avoid infinite loop").
3. Create new node to manage each unique ras instance to guarantee
   each device .ras_list is completely independent.
4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
   interface for each ras block").
5. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[  262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[  262.166529]  ? psi_task_switch+0xd2/0x250
[  262.166537]  ? __switch_to+0x11d/0x460
[  262.166542]  ? __switch_to_asm+0x36/0x70
[  262.166549]  process_one_work+0x220/0x3c0
[  262.166556]  worker_thread+0x4d/0x3f0
[  262.166560]  ? process_one_work+0x3c0/0x3c0
[  262.166563]  kthread+0x12b/0x150
[  262.166568]  ? set_kthread_struct+0x40/0x40
[  262.166571]  ret_from_fork+0x22/0x30

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:59:53 -05:00
Luben Tuikov afa3731591 drm/amdgpu: Print once if RAS unsupported
MESA polls for errors every 2-3 seconds. Printing with dev_info() causes
the dmesg log to fill up with the same message, e.g,

[18028.206676] amdgpu 0000:0b:00.0: amdgpu: df doesn't config ras function.

Make it dev_dbg_once(), as it isn't something correctible during boot or
thereafter, so printing just once is sufficient. Also sanitize the message.

Cc: Alex Deucher <Alexander.Deucher@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Tao Zhou <tao.zhou1@amd.com>
Cc: yipechai <YiPeng.Chai@amd.com>
Fixes: 8b0fb0e967 ("drm/amdgpu: Modify gfx block to fit for the unified ras block data and ops")
Signed-off-by: Luben Tuikov <luben.tuikov@amd.com>
Reviewed-by: Alex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-07 17:14:16 -05:00
yipechai a2170b4af6 drm/amdgpu: Add judgement to avoid infinite loop
1. The infinite loop causing soft lock occurs on multiple amdgpu cards
   supporting ras feature.
2. This a workaround patch to fix 6492e1b07c.
   It is valid for multiple amdgpu cards of the same type.
3. The root cause is that each GPU card device has a separate .ras_list
   link header, but the instance and linked list node of each ras block
   are unique. When each device is initialized, each ras instance will
   repeatedly add link node to the device every time. In this way, only
   the .ras_list of the last initialized device is completely correct.
   the .ras_list->prev and .ras_list->next of the device initialzied
   before can still point to the correct ras instance, but the prev
   pointer and next pointer of the pointed ras instance both point to
   the last initialized device's .ras_ list instead of the beginning
   .ras_ list. When using list_for_each_entry_safe searches for
   non-existent Ras nodes on devices other than the last device, the
   last ras instance next pointer cannot always be equal to the
   beginning .ras_list, so that the loop cannot be terminated, the
   program enters a infinite loop.
 BTW: Since the data and initialization process of each card are the same,
      the link list between ras instances will not be destroyed every time
      the device is initialized.
 4. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[  262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[  262.166529]  ? psi_task_switch+0xd2/0x250
[  262.166537]  ? __switch_to+0x11d/0x460
[  262.166542]  ? __switch_to_asm+0x36/0x70
[  262.166549]  process_one_work+0x220/0x3c0
[  262.166556]  worker_thread+0x4d/0x3f0
[  262.166560]  ? process_one_work+0x3c0/0x3c0
[  262.166563]  kthread+0x12b/0x150
[  262.166568]  ? set_kthread_struct+0x40/0x40
[  262.166571]  ret_from_fork+0x22/0x30

Fixes: 6492e1b07c ("drm/amdgpu: Unify ras block interface for each ras block")
Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-02-02 18:26:31 -05:00
Tao Zhou 400013b268 drm/amdgpu: add umc_fill_error_record to make code more simple
Create common amdgpu_umc_fill_error_record function for all versions
of UMC and clean up related codes.

Signed-off-by: Tao Zhou <tao.zhou1@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-27 15:48:56 -05:00
yipechai e9287ef8d4 Revert "drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list"
This reverts commit df4f0041c6.

Xgmi ras initialization had been moved from .late_init to early_init,
the defect of repeated calling amdgpu_ras_register_ras_block had been
fixed, so revert this patch.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-25 18:00:34 -05:00
yipechai 8eb53bb2aa drm/amdgpu: Remove repeated calls
Remove repeated calls.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-19 22:32:19 -05:00
yipechai b6efdb02d2 drm/amdgpu: Fix the code style warnings in amdgpu_ras
Fix the code style warnings in amdgpu_ras:
1. ERROR: space required before the open parenthesis '('.
2. WARNING: line length of xxx exceeds 100 columns.
3. ERROR: "foo* bar" should be "foo *bar".
4. WARNING: unnecessary whitespace before a quoted newline.
5. WARNING: space prohibited before semicolon.
6. WARNING: suspect code indent for conditional statements.
7. WARNING: braces {} are not necessary for single statement blocks.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Acked-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-18 17:23:58 -05:00
Stanley.Yang 79c0462159 drm/amdgpu: handle denied inject error into critical regions v2
Changed from v1:
    remove unused brace

Signed-off-by: Stanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-18 17:22:36 -05:00
Linus Torvalds 59d41458f1 drm fixes for 5.17-rc1:
drivers fixes:
 - i915 fixes for ttm backend + one pm wakelock fix
 - amdgpu fixes, fairly big pile of small things all over. Note this
   doesn't yet containe the fixed version of the otg sync patch that
   blew up
 - small driver fixes: meson, sun4i, vga16fb probe fix
 
 drm core fixes:
 - cma-buf heap locking
 - ttm compilation
 - self refresh helper state check
 - wrong error message in atomic helpers
 - mipi-dbi buffer mapping
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEb4nG6jLu8Y5XI+PfTA9ye/CYqnEFAmHhrxUACgkQTA9ye/CY
 qnEowxAAgBPwGEobRGMbR3Me98vEKvcWqSxBe/k1VC4LhO5DrvbG5iW9cuxCJZM2
 wlGlGAtU7C7pcCP5Xp1UlMqZ5a0rSVhqMPPkMKO9+7033ofSlAQatnMI1EENH6Hn
 BkhXwTyuOBSN6zqskg8FKqzF+VPTt5ZV2U5qJzQweP/wFtZPAKI4tWE4oKiHactH
 fJHnAi7T6ytF6a7J21BsSEluk4z7BjmcmFF0tW6iuq7Y6TXDFXFq9QFDR041b2rI
 GYDUXl2mebp/L+2M3sPYuMiIiyJ8enh7crNIdmi+EstmzRADa7RMjnY3j2tJg/7M
 pqnuJZAVcpkCurb7NMr3ycmrxnhfUsZfbuXvm+k5yJYfQCaGNiKy1ObsFWH9zBDz
 XMuxcE+csSaX/7rjoyXrL2ZTRPXnVwJNJ8x1CuKn3giLxMSqnPnDMjyHmNLB8qa1
 R0wbPQbdx5+jWgs/ngUGFNo4vFBnNmqQP4G3LaWJ/Ku5cSrEM+Jt9GJOw5jjVgun
 gaOlKlUpMBlKmXOwkvhRW2AwHRcL7lrBuIw0oFOThdMzkSNlKSNBMpAkgvD/9C7g
 IDtJgA7a3MqzLjhPLmhUB3rFyXz5dg5rNZhoH8z4DFiJVpTKYYA5/UoU/RTXZGqW
 yFdrBkFmNJeKTZZAe+e/4OP/dm1vKVEKK6Ko/M9CrELzBKSussg=
 =h74X
 -----END PGP SIGNATURE-----

Merge tag 'drm-next-2022-01-14' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Daniel Vetter:
"drivers fixes:

   - i915 fixes for ttm backend + one pm wakelock fix

   - amdgpu fixes, fairly big pile of small things all over. Note this
     doesn't yet containe the fixed version of the otg sync patch that
     blew up

   - small driver fixes: meson, sun4i, vga16fb probe fix

  drm core fixes:

   - cma-buf heap locking

   - ttm compilation

   - self refresh helper state check

   - wrong error message in atomic helpers

   - mipi-dbi buffer mapping"

* tag 'drm-next-2022-01-14' of git://anongit.freedesktop.org/drm/drm: (49 commits)
  drm/mipi-dbi: Fix source-buffer address in mipi_dbi_buf_copy
  drm: fix error found in some cases after the patch d1af5cd86997
  drm/ttm: fix compilation on ARCH=um
  dma-buf: cma_heap: Fix mutex locking section
  video: vga16fb: Only probe for EGA and VGA 16 color graphic cards
  drm/amdkfd: Fix ASIC name typos
  drm/amdkfd: Fix DQM asserts on Hawaii
  drm/amdgpu: Use correct VIEWPORT_DIMENSION for DCN2
  drm/amd/pm: only send GmiPwrDnControl msg on master die (v3)
  drm/amdgpu: use spin_lock_irqsave to avoid deadlock by local interrupt
  drm/amdgpu: not return error on the init_apu_flags
  drm/amdkfd: Use prange->update_list head for remove_list
  drm/amdkfd: Use prange->list head for insert_list
  drm/amdkfd: make SPDX License expression more sound
  drm/amdkfd: Check for null pointer after calling kmemdup
  drm/amd/display: invalid parameter check in dmub_hpd_callback
  Revert "drm/amdgpu: Don't inherit GEM object VMAs in child process"
  drm/amd/display: reset dcn31 SMU mailbox on failures
  drm/amdkfd: use default_groups in kobj_type
  drm/amdgpu: use default_groups in kobj_type
  ...
2022-01-16 06:52:38 +02:00
yipechai e3d833f41c drm/amdgpu: fix compile warning for ras_block_match_default
fix compile warning for ras_block_match_default

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:44 -05:00
yipechai 954ea6aa15 drm/amdgpu: Use ARRAY_SIZE to get array length
Use ARRAY_SIZE to get array length.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:26 -05:00
Yang Li ab3b9de65b drm/amdgpu: clean up some inconsistent indenting
Eliminate the follow smatch warnings:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:3504 amdgpu_device_init()
warn: inconsistent indenting
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1716
amdgpu_ras_error_status_query() warn: if statement not indented
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1058 amdgpu_ras_error_inject()
warn: inconsistent indenting

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:10 -05:00
Yang Li 69f91d32c6 drm/amdgpu: remove unneeded semicolon
Eliminate the following coccicheck warning:
./drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:2725:16-17: Unneeded semicolon

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: Guchun Chen <guchun.chen@amd.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:07 -05:00
yipechai df4f0041c6 drm/amdgpu: No longer insert ras blocks into ras_list if it already exists in ras_list
No longer insert ras blocks into ras_list if it already exists in ras_list.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:04 -05:00
yipechai df01fe73ee drm/amdgpu: Add ras supported check for register_ras_block
Add ras supported check for register_ras_block.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:00 -05:00
yipechai 7389a5b837 drm/amdgpu: Removed redundant ras code
Removed redundant ras code.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:00 -05:00
yipechai 22d4ba53b1 drm/amdgpu: Adjust error inject function code style in amdgpu_ras.c
1. Move xgmi special error inject function from amdgpu_ras.c to xgmi block.
2. Support to use psp_ras_trigger_error as default error inject function in amdgpu_ras.c. If .ras_error_inject isn't defined in ras block, default error inject function will take effect.

v2: squash in warning fix (Alex)

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:00 -05:00
yipechai bdc4292bd3 drm/amdgpu: Modify sdma block to fit for the unified ras block data and ops
1.Modify sdma block to fit for the unified ras block data and ops.
2.Change amdgpu_sdma_ras_funcs to amdgpu_sdma_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of sdma ras variable so that sdma ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register sdma ras block into amdgpu device ras block link list.
5.Remove the redundant code about sdma in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of sdma versions. If .ras_late_init and .ras_fini had been defined by the selected sdma version, the defined functions will take effect; if not defined, default fill them with amdgpu_sdma_ras_late_init and amdgpu_sdma_ras_fini.

v2: squash in warning fix (Alex)

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:52:00 -05:00
yipechai efe17d5a21 drm/amdgpu: Modify umc block to fit for the unified ras block data and ops
1.Modify umc block to fit for the unified ras block data and ops.
2.Change amdgpu_umc_ras_funcs to amdgpu_umc_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of umc ras variable so that umc ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register umc ras block into amdgpu device ras block link list.
5.Remove the redundant code about umc in amdgpu_ras.c after using the unified ras block.
6.Fill unified ras block .name .block .ras_late_init and .ras_fini for all of umc versions. If .ras_late_init and .ras_fini had been defined by the selected umc version, the defined functions will take effect; if not defined, default fill them with amdgpu_umc_ras_late_init and amdgpu_umc_ras_fini.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:51:59 -05:00
yipechai 2e54fe5d05 drm/amdgpu: Modify nbio block to fit for the unified ras block data and ops
1.Modify nbio block to fit for the unified ras block data and ops.
2.Change amdgpu_nbio_ras_funcs to amdgpu_nbio_ras, and the corresponding variable name remove _funcs suffix.
3.Remove the const flag of mmhub ras variable so that nbio ras block can be able to be inserted into amdgpu device ras block link list.
4.Invoke amdgpu_ras_register_ras_block function to register nbio ras block into amdgpu device ras block link list.
5.Remove the redundant code about nbio in amdgpu_ras.c after using the unified ras block.

Signed-off-by: yipechai <YiPeng.Chai@amd.com>
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: John Clements <john.clements@amd.com>
Reviewed-by: Tao Zhou <tao.zhou1@amd.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
2022-01-14 17:51:59 -05:00