[opt-rocm/rocm] "HIP error: invalid device function" when printing GPU tensors

Description:

Printing GPU tensors produced with some functions like torch.ones produces a weird invalid device function exception. Printing a tensor created with torch.tensor works. The official pytorch wheel installed with pip from their own repos works without issues, so in light of #28 (closed) I would guess this to be caused by ROCm packaging as well, but I have no idea which component. I have tried to compile pytorch with device side assertions enabled and I tried to run with kernel serialization, but neither gave any more detailed info. I have also attempted to compile pytorch from the official Git repo without makepkg, which produces the very same error.

I'm not sure how to debug this, as this manifests as a Python exception instead of a segfault. I managed to get a breakpoint at the function producing the error message text (c10::hip::get_hip_check_suffix, source code here), but I don't seem to have enough debug symbols right now to dig further.

Additional info:

package version(s): 2.8.0-3 for both -rocm and -opt-rocm variants
hardware: RX 6900 XT (target name gfx1030)
config and/or log files:

Exception:

Traceback (most recent call last):
  File "<python-input-2>", line 1, in <module>
    print(torch.ones(1, device='cuda'))
    ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/site-packages/torch/_tensor.py", line 590, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.13/site-packages/torch/_tensor_str.py", line 726, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/usr/lib/python3.13/site-packages/torch/_tensor_str.py", line 647, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/usr/lib/python3.13/site-packages/torch/_tensor_str.py", line 379, in _tensor_str
    formatter = _Formatter(get_summarized_data(self) if summarize else self)
  File "/usr/lib/python3.13/site-packages/torch/_tensor_str.py", line 154, in __init__
    nonzero_finite_vals = torch.masked_select(
        tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
    )
torch.AcceleratorError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Call stack for c10::hip::get_hip_check_suffix if it helps:

#0  0x00007fffcdf09c34 in c10::hip::get_hip_check_suffix() () from /usr/lib/libc10_hip.so
#1  0x00007fffcdefec52 in c10::hip::c10_hip_check_implementation(int, char const*, char const*, int, bool) () from /usr/lib/libc10_hip.so
#2  0x00007fffcf4633c0 in void at::native::nonzero_cuda_out_impl<bool>(at::Tensor const&, at::Tensor&) () from /usr/lib/libtorch_hip.so
#3  0x00007fffcf43449b in at::native::nonzero_out_cuda(at::Tensor const&, at::Tensor&) () from /usr/lib/libtorch_hip.so
#4  0x00007fffcf43485c in at::native::nonzero_cuda(at::Tensor const&) () from /usr/lib/libtorch_hip.so
#5  0x00007fffd02945de in ?? () from /usr/lib/libtorch_hip.so
#6  0x00007fffd02946b1 in ?? () from /usr/lib/libtorch_hip.so
#7  0x00007fffe06e2b7d in at::_ops::nonzero::call(at::Tensor const&) () from /usr/lib/libtorch_cpu.so
#8  0x00007fffe00e7c7f in ?? () from /usr/lib/libtorch_cpu.so
#9  0x00007fffe00dac68 in at::meta::structured_index_Tensor::meta(at::Tensor const&, c10::IListRef<at::OptionalTensorRef>) () from /usr/lib/libtorch_cpu.so
#10 0x00007fffd03d9362 in ?? () from /usr/lib/libtorch_hip.so
#11 0x00007fffd01f1877 in ?? () from /usr/lib/libtorch_hip.so
#12 0x00007fffd01f2057 in at::native::masked_select_cuda(at::Tensor const&, at::Tensor const&) () from /usr/lib/libtorch_hip.so
#13 0x00007fffd02a0a43 in ?? () from /usr/lib/libtorch_hip.so
#14 0x00007fffd02a0b14 in ?? () from /usr/lib/libtorch_hip.so
#15 0x00007fffe0a0ae2a in at::_ops::masked_select::redispatch(c10::DispatchKeySet, at::Tensor const&, at::Tensor const&) () from /usr/lib/libtorch_cpu.so
#16 0x00007fffe3694e33 in ?? () from /usr/lib/libtorch_cpu.so
#17 0x00007fffe3695267 in ?? () from /usr/lib/libtorch_cpu.so
#18 0x00007fffe0a895bf in at::_ops::masked_select::call(at::Tensor const&, at::Tensor const&) () from /usr/lib/libtorch_cpu.so
#19 0x00007fffed8cc42f in ?? () from /usr/lib/python3.13/site-packages/torch/lib/libtorch_python.so
#20 0x00007ffff79960cc in cfunction_call (func=0x7ffff6739080, args=0x7ffff69f7200, kwargs=0x0) at Objects/methodobject.c:539
#21 0x00007ffff79620cb in _PyObject_MakeTpCall (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffff6739080, args=0x7ffff7f6aab0, nargs=<optimized out>, keywords=<optimized out>) at Objects/call.c:242
#22 0x00007ffff797697a in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/generated_cases.c.h:813
#23 0x00007ffff79aa53a in _PyEval_EvalFrame (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, frame=0x7ffff7f6aa10, throwflag=0) at ./Include/internal/pycore_ceval.h:119
#24 _PyEval_Vector (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, func=0x7ffde15a1bc0, locals=0x0, args=0x7fffffffc060, argcount=2, kwnames=0x0) at Python/ceval.c:1816
#25 _PyFunction_Vectorcall (func=0x7ffde15a1bc0, stack=0x7fffffffc060, nargsf=<optimized out>, kwnames=0x0) at Objects/call.c:413
#26 _PyObject_VectorcallDictTstate (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffde15a1bc0, args=0x7fffffffc060, nargsf=<optimized out>, kwargs=<optimized out>) at Objects/call.c:135
#27 _PyObject_Call_Prepend (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffde15a1bc0, obj=<optimized out>, args=<optimized out>, kwargs=<optimized out>) at Objects/call.c:504
#28 slot_tp_init (self=<optimized out>, args=<optimized out>, kwds=<optimized out>) at Objects/typeobject.c:9816
#29 0x00007ffff796202d in type_call (self=0x55555b789af0, args=0x7ffff69af0a0, kwds=0x0) at Objects/typeobject.c:1997
#30 _PyObject_MakeTpCall (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x55555b789af0, args=<optimized out>, nargs=<optimized out>, keywords=<optimized out>) at Objects/call.c:242
#31 0x00007ffff797697a in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/generated_cases.c.h:813
#32 0x00007ffff79a7c96 in _PyObject_VectorcallTstate (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffde1f83d80, args=0x7fffffffc308, nargsf=9223372036854775809, kwnames=0x0) at ./Include/internal/pycore_call.h:168
#33 PyObject_CallOneArg (func=0x7ffde1f83d80, arg=<optimized out>) at Objects/call.c:395
#34 0x00007ffff7a9f550 in slot_tp_repr (self=0x7ffff699dcc0) at Objects/typeobject.c:9499
#35 0x00007ffff79ba3c7 in object_str (self=0x7ffff699dcc0) at Objects/typeobject.c:6250
#36 PyObject_Str (v=0x7ffff699dcc0) at Objects/object.c:814
#37 0x00007ffff7a8dbab in PyFile_WriteObject (v=0x7ffff699dcc0, f=<optimized out>, flags=<optimized out>) at Objects/fileobject.c:117
#38 0x00007ffff7a8d12d in builtin_print_impl (module=<optimized out>, args=0x7ffff69af130, sep=0x0, end=<optimized out>, file=0x7ffff7716500, flush=0) at Python/bltinmodule.c:2123
#39 builtin_print (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at Python/clinic/bltinmodule.c.h:981
#40 0x00007ffff798d976 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x7ffff7f6a668, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:440
#41 0x00007ffff79645bd in _PyObject_VectorcallTstate (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffff779e430, args=0x7ffff7f6a668, nargsf=9223372036854775809, kwnames=0x0) at ./Include/internal/pycore_call.h:168
#42 PyObject_Vectorcall (callable=0x7ffff779e430, args=0x7ffff7f6a668, nargsf=9223372036854775809, kwnames=0x0) at Objects/call.c:327
#43 0x00007ffff797697a in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/generated_cases.c.h:813
#44 0x00007ffff7a4e2c9 in PyEval_EvalCode (co=0x7ffff69fc330, globals=<optimized out>, locals=0x7ffff6d34e80) at Python/ceval.c:604
#45 0x00007ffff7a69083 in builtin_exec_impl (module=<optimized out>, source=0x7ffff69fc330, globals=0x7ffff6d34e80, locals=0x7ffff6d34e80, closure=<optimized out>) at Python/bltinmodule.c:1143
#46 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at Python/clinic/bltinmodule.c.h:556
#47 0x00007ffff797631e in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/generated_cases.c.h:1217
#48 0x00007ffff7a4e2c9 in PyEval_EvalCode (co=0x7ffff6bc5f30, globals=<optimized out>, locals=0x7ffff6d34e80) at Python/ceval.c:604
#49 0x00007ffff7a69083 in builtin_exec_impl (module=<optimized out>, source=0x7ffff6bc5f30, globals=0x7ffff6d34e80, locals=0x7ffff6d34e80, closure=<optimized out>) at Python/bltinmodule.c:1143
#50 builtin_exec (module=<optimized out>, args=<optimized out>, nargs=<optimized out>, kwnames=<optimized out>) at Python/clinic/bltinmodule.c.h:556
#51 0x00007ffff798d976 in cfunction_vectorcall_FASTCALL_KEYWORDS (func=<optimized out>, args=0x7ffff7f6a180, nargsf=<optimized out>, kwnames=0x0) at Objects/methodobject.c:440
#52 0x00007ffff79645bd in _PyObject_VectorcallTstate (tstate=0x7ffff7d32df0 <_PyRuntime+283024>, callable=0x7ffff779dd50, args=0x7ffff7f6a180, nargsf=9223372036854775810, kwnames=0x0) at ./Include/internal/pycore_call.h:168
#53 PyObject_Vectorcall (callable=0x7ffff779dd50, args=0x7ffff7f6a180, nargsf=9223372036854775810, kwnames=0x0) at Objects/call.c:327
#54 0x00007ffff797697a in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/generated_cases.c.h:813
#55 0x00007ffff7a847b0 in pymain_run_module (modname=modname@entry=0x7ffff7c01150 L"_pyrepl", set_argv0=set_argv0@entry=0) at Modules/main.c:349
#56 0x00007ffff7892634 in pymain_run_stdin (config=0x7ffff7d054e8 <_PyRuntime+96392>) at Modules/main.c:575
#57 pymain_run_python (exitcode=0x7fffffffce1c) at Modules/main.c:699
#58 Py_RunMain () at Modules/main.c:775
#59 0x00007ffff7a3bbeb in Py_BytesMain (argc=<optimized out>, argv=<optimized out>) at Modules/main.c:829
#60 0x00007ffff7427675 in __libc_start_call_main (main=main@entry=0x555555555120 <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffd088) at ../sysdeps/nptl/libc_start_call_main.h:58
#61 0x00007ffff7427729 in __libc_start_main_impl (main=0x555555555120 <main>, argc=1, argv=0x7fffffffd088, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffd078) at ../csu/libc-start.c:360
#62 0x0000555555555045 in _start ()

Steps to reproduce:

Run the following Python code:

import torch
print(torch.ones(1, device='cuda'))

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

[opt-rocm/rocm] "HIP error: invalid device function" when printing GPU tensors

Description:

Additional info:

Steps to reproduce: