[opt-rocm/rocm] Matrix multiplication causes segfault

Description:

The matrix multiplication operation is causing a segfault. This bug is very likely a packaging bug as pytorch 2.8 installed in a venv using the following command does not exhibit the bugged behaviour.

pip3 install torch torchvision --index-url https://download.pytorch.org/whl/rocm6.4

Additional info:

package version(s): python-pytorch-opt-rocm 2.8.0-2 (also affects python-pytorch-rocm)
hardware: 7900 xtx
config and/or log files:

[opdesktop:40146:0:40146] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28)
==== backtrace (tid:  40146) ====
 0 0x000000000004dd71 ucs_rcache_distribution_get_num_bins()  ???:0
 1 0x000000000004df3d ucs_rcache_distribution_get_num_bins()  ???:0
 2 0x000000000003e540 __sigaction()  ???:0
 3 0x00000000003c4640 hiprtcLinkDestroy()  ???:0
 4 0x00000000003f7bc2 hiprtcLinkDestroy()  ???:0
 5 0x00000000000bbc12 hipGetDeviceProperties()  ???:0
 6 0x00000000000bc909 hipGetDeviceProperties()  ???:0
 7 0x00000000000571b2 hipGetCmdName()  ???:0
 8 0x0000000000057356 hipGetCmdName()  ???:0
 9 0x00000000002d0ac6 __gnu_f2h_ieee()  ???:0
10 0x000000000028beaf hipRegisterTracerCallback()  ???:0
11 0x000000000139c64a rocblas_zswap_strided_batched_64()  ???:0
12 0x000000000139ddb4 rocblas_zswap_strided_batched_64()  ???:0
13 0x000000000094eb6b rocblas_initialize()  ???:0
14 0x000000000095f36a rocblas_internal_tensile_is_initialized()  ???:0
15 0x000000000054ff01 rocblas_internal_gemm_template<float>()  ???:0
16 0x000000000053440b rocblas_sgemm()  ???:0
17 0x00000000000d98ab hipblasSgemm()  ???:0
18 0x0000000001f0f065 at::cuda::blas::getrfBatched<c10::complex<float> >()  ???:0
19 0x0000000001ff1895 at::native::_bmm_dtype_cuda()  ???:0
20 0x0000000001ff364b at::native::structured_mm_out_cuda::impl()  ???:0
21 0x0000000002354c40 at::cuda::bmm()  ???:0
22 0x0000000002354d04 at::cuda::mm()  ???:0
23 0x00000000025d662a at::_ops::mm::redispatch()  ???:0
24 0x00000000051deca9 std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_append<std::optional<bool>&>()  ???:0
25 0x00000000051df307 std::vector<c10::IValue, std::allocator<c10::IValue> >::_M_realloc_append<std::optional<bool>&>()  ???:0
26 0x0000000002639ecf at::_ops::mm::call()  ???:0
27 0x00000000018c1ecb at::native::linalg_pinv_out()  ???:0
28 0x00000000018c2574 at::native::matmul()  ???:0
29 0x0000000002c81ad5 at::compositeimplicitautogradnestedtensor::reshape()  ???:0
30 0x000000000278fb6f at::_ops::matmul::call()  ???:0
31 0x0000000000463476 std::vector<bool, std::allocator<bool> >::_M_fill_insert()  ???:0
32 0x000000000046352e std::vector<bool, std::allocator<bool> >::_M_fill_insert()  ???:0
33 0x000000000019c3e3 _PyObject_GetMethod()  ???:0
34 0x000000000020e50a PyType_GetModule()  ???:0
35 0x00000000000ea954 PyNumber_InPlacePower()  ???:0
36 0x000000000019f448 PyNumber_Add()  ???:0
37 0x00000000002ee25b PyNumber_MatrixMultiply()  ???:0
38 0x00000000001739a2 _PyEval_EvalFrameDefault()  ???:0
39 0x000000000024e2c9 PyEval_EvalCode()  ???:0
40 0x0000000000269083 PyMem_RawCalloc()  ???:0
41 0x000000000017631e _PyEval_EvalFrameDefault()  ???:0
42 0x000000000024e2c9 PyEval_EvalCode()  ???:0
43 0x0000000000269083 PyMem_RawCalloc()  ???:0
44 0x000000000018d976 _PyFunction_SetVersion()  ???:0
45 0x00000000001645bd PyObject_Vectorcall()  ???:0
46 0x000000000017697a _PyEval_EvalFrameDefault()  ???:0
47 0x00000000002847b0 PyUnicode_AsUTF8String()  ???:0
48 0x0000000000092634 ???()  /usr/lib/libpython3.13.so.1.0:0
49 0x000000000023bbeb Py_BytesMain()  ???:0
50 0x0000000000027675 __libc_init_first()  ???:0
51 0x0000000000027729 __libc_start_main()  ???:0
52 0x0000000000001045 _start()  ???:0
=================================
[1]    40146 segmentation fault (core dumped)  python

Steps to reproduce:

Run the following python script:

import torch
x = torch.randn(1000, 1000, device="cuda")
y = torch.randn(1000, 1000, device="cuda")
z = x @ y

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

[opt-rocm/rocm] Matrix multiplication causes segfault

Description:

Additional info:

Steps to reproduce: