Half-precision floating-point broken in Tensile when compiled with rocm-llvm 6.2.0

Description:

Running ollama or llama.cpp on ROCm 6.2 results in the same CUBLAS_STATUS_INTERNAL_ERROR error:

rocBLAS error: Tensile solution found, but exception thrown for { a_type: "f16_r", b_type: "f16_r", c_type: "f16_r", d_type: "f16_r", compute_type: "f16_r", transA: 'T', transB: 'N', M: 128, N: 2, K: 32, alpha: 1, row_stride_a: 1, col_stride_a: 4096, row_stride_b: 1, col_stride_b: 32, row_stride_c: 1, col_stride_c: 128, row_stride_d: 1, col_stride_d: 128, beta: 0, batch_count: 40, strided_batch: true, stride_a: 524288, stride_b: 64, stride_c: 256, stride_d: 256, atomics_mode: atomics_allowed }
Alpha value 7.21875 doesn't match that set in problem: 1
ggml/src/ggml-cuda.cu:70: ROCm errorROCm error: CUBLAS_STATUS_INTERNAL_ERROR
  current device: 0, in function ggml_cuda_mul_mat_batched_cublas at ggml/src/ggml-cuda.cu:1839
  hipblasGemmStridedBatchedEx(ctx.cublas_handle(), HIPBLAS_OP_T, HIPBLAS_OP_N, ne01, ne11, ne10, alpha, (const char *) src0_f16, HIPBLAS_R_16F, nb01/nb00, nb02/nb00, (const char *) src1_f16, HIPBLAS_R_16F, nb11/nb10, nb12/nb10, beta, ( char *) dst_t, cu_data_type, ne01, nb2/nb0, ne12*ne13, cu_compute_type, HIPBLAS_GEMM_DEFAULT)

ptrace: Operation not permitted.
No stack.
The program is not being run.
Aborted (core dumped)

When running with AMD_LOG_LEVEL=4, llama.cpp causes multiple hipErrorNotFound errors, where ollama only causes an hipErrorNotReady error (see logs).

The error is independent of the model being run.

Tested on gfx1030 / RX 6900 XT.