Video seek (sometimes including file open) crashes amdgpu (GPU reset, xorg terminated) linux 6.10.1 + mpv gpu-next x11egl vaapi x264
I'm filing the bug here since any other affected users would be likely to check mpv rather than linux.
Broken: linux 6.10.1.arch1-1 combined with the current mpv/ffmpeg/etc stack
Working workaround: use linux-lts as mpv devs suggest https://github.com/mpv-player/mpv/issues/14600
Incomplete workaround (other packages break due to unmet deps, use -b or add the other packages on your system.) pacman -U mpv-1:0.38.0-4-x86_64.pkg.tar.zst libplacebo-6.338.2-6-x86_64.pkg.tar.zst ffmpeg-2:6.1.1-7-x86_64.pkg.tar.zst x265-3.5-3-x86_64.pkg.tar.zst ffmpeg4.4-4.4.4-5-x86_64.pkg.tar.zst
Working: linux-lts + current mpv ( mpv 1:0.38.0-6 ) etc
mpv --version mpv v0.38.0-dirty Copyright © 2000-2024 mpv/MPlayer/mplayer2 projects built on Jul 3 2024 05:59:22 libplacebo version: v7.349.0 FFmpeg version: n7.0.1 FFmpeg library versions: libavutil 59.8.100 libavcodec 61.3.100 libavformat 61.1.100 libswscale 8.1.100 libavfilter 10.1.100 libswresample 5.1.100
Steps to reproduce; more than one (but not all), random video files I tested trigger the issue when the Swiss cheese lines up. If you are affected when seeking (sometimes including when initially loading a video file) the video output becomes corrupted (blocks of raw data dumped in squares, anywhere in the framebuffer); followed by a GPU crash and reset.
Upstream points at the kernel alone as the fault, so I read that as: stack traces for mpv seem fully normal to a subject expert.
-- mpv -vo gpu-next crashed plasmashell during these events
[ 1766.321165] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0a22c802
[ 1766.321171] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321172] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101F44
[ 1766.321174] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0C8002
[ 1766.321175] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1056580, write from 'TC3' (0x54433300) (200)
[ 1766.321237] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07f2a002
[ 1766.321238] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321239] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010120C
[ 1766.321240] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321241] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053196, write from 'CB2' (0x43423200) (32)
[ 1766.321244] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b29002
[ 1766.321245] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101237
[ 1766.321247] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B010002
[ 1766.321248] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053239, write from 'CB3' (0x43423300) (16)
[ 1766.321255] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772e002
[ 1766.321256] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321257] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101200
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321258] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053184, write from 'CB4' (0x43423400) (160)
[ 1766.321262] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0772d002
[ 1766.321263] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101232
[ 1766.321264] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0A0002
[ 1766.321265] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053234, write from 'CB4' (0x43423400) (160)
[ 1766.321268] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07729002
[ 1766.321269] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010123A
[ 1766.321271] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321272] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053242, write from 'CB1' (0x43423100) (80)
[ 1766.321275] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732d002
[ 1766.321276] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321277] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001012AB
[ 1766.321278] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B020002
[ 1766.321279] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053355, write from 'CB2' (0x43423200) (32)
[ 1766.321282] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07126002
[ 1766.321283] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321284] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010124C
[ 1766.321285] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0E0002
[ 1766.321286] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053260, write from 'CB6' (0x43423600) (224)
[ 1766.321289] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x07b21002
[ 1766.321290] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321291] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101223
[ 1766.321292] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B050002
[ 1766.321293] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
[ 1766.321296] amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
[ 1766.321297] amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277
[ 1766.321298] amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
[ 1766.321299] amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
[ 1777.234990] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
[ 1777.236251] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053219, write from 'CB1' (0x43423100) (80)
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU fault detected: 147 0x0732a002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: for process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101277
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B0D0002
Jul 25 22:09:10 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: VM fault (0x02, vmid 5, pasid 32772) at page 1053303, write from 'CB7' (0x43423700) (208)
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=168813, emitted seq=168816
Jul 25 22:09:21 HOSTNAME kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 2961 thread plasmashel:cs0 pid 3007
Jul 25 22:09:21 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset begin!
Jul 25 22:09:21 HOSTNAME kernel: amdgpu: cp is busy, skip halt cp
Jul 25 22:09:22 HOSTNAME kernel: amdgpu: rlc is busy, skip halt rlc
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: BACO reset
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset succeeded, trying to resume
Jul 25 22:09:22 HOSTNAME kernel: [drm] PCIE GART of 1024M enabled (table at 0x000000F400800000).
Jul 25 22:09:22 HOSTNAME kernel: [drm] VRAM is lost due to GPU reset!
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring comp_1.2.0 test failed (-110)
Jul 25 22:09:22 HOSTNAME kernel: [drm] UVD initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: [drm] VCE initialized successfully.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow start
Jul 25 22:09:22 HOSTNAME mpv[5307]: amdgpu: The CS has cancelled because the context is lost. This context is innocent.
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: recover vram bo from shadow done
Jul 25 22:09:22 HOSTNAME kernel: amdgpu 0000:01:00.0: amdgpu: GPU reset(2) succeeded!
Jul 25 22:09:22 HOSTNAME systemd-coredump[5681]: Process 5307 (mpv) of user 1000 terminated abnormally with signal 6/ABRT, processing...
Jul 25 22:09:22 HOSTNAME systemd[1]: Created slice Slice /system/drkonqi-coredump-processor.
-- Subject: A start job for unit system-drkonqi\x2dcoredump\x2dprocessor.slice has finished successfully