Skip to content

Conversation

@arsenm
Copy link
Contributor

@arsenm arsenm commented Nov 5, 2025

Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.

This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.

The coalescer does not just widen registers out of nowhere. If it's trying
to "widen" a register, it's generally packing a register into an existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a110 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ea didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.

I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.

Copy link
Contributor Author

arsenm commented Nov 5, 2025

@llvmbot
Copy link
Member

llvmbot commented Nov 5, 2025

@llvm/pr-subscribers-llvm-globalisel
@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

Changes

Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.

This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.

The coalescer does not just widen registers out of nowhere. If it's trying
to "widen" a register, it's generally packing a register into an existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a110 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ea didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.

I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.


Patch is 1.21 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166475.diff

48 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp (+4-11)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll (+74-90)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll (+390-418)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdiv.i64.ll (+1129-1156)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/sdivrem.ll (+496-501)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/srem.i64.ll (+1211-1245)
  • (modified) llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll (+349-353)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll (+2-3)
  • (modified) llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll (+3-4)
  • (modified) llvm/test/CodeGen/AMDGPU/buffer-fat-pointers-contents-legalization.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/bypass-div.ll (+82-84)
  • (modified) llvm/test/CodeGen/AMDGPU/dagcomb-mullohi.ll (+4-6)
  • (modified) llvm/test/CodeGen/AMDGPU/div-rem-by-constant-64.ll (+28-32)
  • (modified) llvm/test/CodeGen/AMDGPU/div_v2i128.ll (+118-124)
  • (modified) llvm/test/CodeGen/AMDGPU/fptoi.i128.ll (+310-314)
  • (modified) llvm/test/CodeGen/AMDGPU/hazard-recognizer-src-shared-base.ll (+3-3)
  • (modified) llvm/test/CodeGen/AMDGPU/identical-subrange-spill-infloop.ll (+6-6)
  • (modified) llvm/test/CodeGen/AMDGPU/integer-mad-patterns.ll (+298-333)
  • (modified) llvm/test/CodeGen/AMDGPU/lds-misaligned-bug.ll (+95-102)
  • (modified) llvm/test/CodeGen/AMDGPU/limit-coalesce.mir (+40-16)
  • (modified) llvm/test/CodeGen/AMDGPU/llvm.round.f64.ll (+9-11)
  • (modified) llvm/test/CodeGen/AMDGPU/load-global-i16.ll (+15-18)
  • (modified) llvm/test/CodeGen/AMDGPU/load-local-i16.ll (+56-59)
  • (modified) llvm/test/CodeGen/AMDGPU/mad_64_32.ll (+33-39)
  • (modified) llvm/test/CodeGen/AMDGPU/mul.ll (+83-86)
  • (modified) llvm/test/CodeGen/AMDGPU/rem_i128.ll (+32-34)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2f32.v2f32.ll (+28-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2f32.v3f32.ll (+278-242)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i32.v2i32.ll (+28-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2i32.v3i32.ll (+278-242)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p3.v2p3.ll (+28-36)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v2p3.v3p3.ll (+278-242)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3f32.v2f32.ll (+310-578)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3f32.v3f32.ll (+190-230)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i32.v2i32.ll (+310-578)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3i32.v3i32.ll (+190-230)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p3.v2p3.ll (+310-578)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v3p3.v3p3.ll (+190-230)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4bf16.v3bf16.ll (+56-68)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4bf16.v4bf16.ll (+39-48)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f16.v3f16.ll (+56-68)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f16.v4f16.ll (+39-48)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4f32.v3f32.ll (+40-50)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i16.v3i16.ll (+56-68)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i16.v4i16.ll (+39-48)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4i32.v3i32.ll (+40-50)
  • (modified) llvm/test/CodeGen/AMDGPU/shufflevector.v4p3.v3p3.ll (+40-50)
  • (modified) llvm/test/CodeGen/AMDGPU/vector-reduce-mul.ll (+311-386)
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index a6c1af24e13e9..152f0f85c9978 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -3741,18 +3741,11 @@ bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,
                                     unsigned DstSubReg,
                                     const TargetRegisterClass *NewRC,
                                     LiveIntervals &LIS) const {
-  unsigned SrcSize = getRegSizeInBits(*SrcRC);
-  unsigned DstSize = getRegSizeInBits(*DstRC);
+  // TODO: This should be more aggressive, but be more cautious with very wide
+  // tuples.
   unsigned NewSize = getRegSizeInBits(*NewRC);
-
-  // Do not increase size of registers beyond dword, we would need to allocate
-  // adjacent registers and constraint regalloc more than needed.
-
-  // Always allow dword coalescing.
-  if (SrcSize <= 32 || DstSize <= 32)
-    return true;
-
-  return NewSize <= DstSize || NewSize <= SrcSize;
+  return NewSize <= 128 || NewSize <= getRegSizeInBits(*SrcRC) ||
+         NewSize <= getRegSizeInBits(*DstRC);
 }
 
 unsigned SIRegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
index 1cd9c0bfeb7e6..85c6035e3d840 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
@@ -8,17 +8,16 @@ define amdgpu_kernel void @v_mul_i64_no_zext(ptr addrspace(1) %out, ptr addrspac
 ; GFX10-LABEL: v_mul_i64_no_zext:
 ; GFX10:       ; %bb.0:
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x2c
-; GFX10-NEXT:    v_lshlrev_b32_e32 v7, 3, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v6, 3, v0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    global_load_dwordx2 v[0:1], v7, s[0:1]
-; GFX10-NEXT:    global_load_dwordx2 v[2:3], v7, s[2:3]
+; GFX10-NEXT:    global_load_dwordx2 v[2:3], v6, s[0:1]
+; GFX10-NEXT:    global_load_dwordx2 v[4:5], v6, s[2:3]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s0, v0, v2, 0
-; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s0, v0, v3, v[5:6]
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, v1, v2, v[5:6]
-; GFX10-NEXT:    v_mov_b32_e32 v5, v0
-; GFX10-NEXT:    global_store_dwordx2 v7, v[4:5], s[2:3]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s0, v2, v4, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s0, v2, v5, v[1:2]
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s0, v3, v4, v[1:2]
+; GFX10-NEXT:    global_store_dwordx2 v6, v[0:1], s[2:3]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: v_mul_i64_no_zext:
@@ -26,19 +25,17 @@ define amdgpu_kernel void @v_mul_i64_no_zext(ptr addrspace(1) %out, ptr addrspac
 ; GFX11-NEXT:    s_load_b128 s[0:3], s[4:5], 0x2c
 ; GFX11-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v9, 3, v0
+; GFX11-NEXT:    v_lshlrev_b32_e32 v8, 3, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_clause 0x1
-; GFX11-NEXT:    global_load_b64 v[0:1], v9, s[0:1]
-; GFX11-NEXT:    global_load_b64 v[2:3], v9, s[2:3]
+; GFX11-NEXT:    global_load_b64 v[2:3], v8, s[0:1]
+; GFX11-NEXT:    global_load_b64 v[4:5], v8, s[2:3]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, v0, v2, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v2, v4, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[6:7], null, v0, v3, v[5:6]
-; GFX11-NEXT:    v_mad_u64_u32 v[7:8], null, v1, v2, v[6:7]
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v5, v7
-; GFX11-NEXT:    global_store_b64 v9, v[4:5], s[2:3]
+; GFX11-NEXT:    v_mad_u64_u32 v[6:7], null, v2, v5, v[1:2]
+; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v3, v4, v[6:7]
+; GFX11-NEXT:    global_store_b64 v8, v[0:1], s[2:3]
 ; GFX11-NEXT:    s_endpgm
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -58,18 +55,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src1(ptr addrspace(1) %out, ptr addrsp
 ; GFX10-NEXT:    s_clause 0x1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX10-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT:    v_lshlrev_b32_e32 v2, 3, v0
-; GFX10-NEXT:    v_lshlrev_b32_e32 v3, 2, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-NEXT:    global_load_dwordx2 v[0:1], v2, s[2:3]
-; GFX10-NEXT:    global_load_dword v4, v3, s[6:7]
+; GFX10-NEXT:    global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX10-NEXT:    global_load_dword v4, v0, s[6:7]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s2, v0, v4, 0
-; GFX10-NEXT:    v_mov_b32_e32 v0, v3
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v1, v4, v[0:1]
-; GFX10-NEXT:    v_mov_b32_e32 v3, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10-NEXT:    global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v1, v4, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s2, v2, v4, v[1:2]
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: v_mul_i64_zext_src1:
@@ -80,17 +75,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src1(ptr addrspace(1) %out, ptr addrsp
 ; GFX11-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v1, 3, v0
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
+; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 2, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX11-NEXT:    global_load_b64 v[0:1], v1, s[2:3]
-; GFX11-NEXT:    global_load_b32 v5, v2, s[4:5]
+; GFX11-NEXT:    global_load_b64 v[1:2], v1, s[2:3]
+; GFX11-NEXT:    global_load_b32 v5, v0, s[4:5]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v0, v5, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v1, v5, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v1, v5, v[0:1]
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v2, v5, v[1:2]
+; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
 ; GFX11-NEXT:    s_endpgm
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -110,18 +104,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src0(ptr addrspace(1) %out, ptr addrsp
 ; GFX10-NEXT:    s_clause 0x1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX10-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT:    v_lshlrev_b32_e32 v2, 2, v0
-; GFX10-NEXT:    v_lshlrev_b32_e32 v3, 3, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v3, 2, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
-; GFX10-NEXT:    global_load_dword v4, v2, s[2:3]
-; GFX10-NEXT:    global_load_dwordx2 v[0:1], v3, s[6:7]
+; GFX10-NEXT:    global_load_dword v4, v3, s[2:3]
+; GFX10-NEXT:    global_load_dwordx2 v[1:2], v0, s[6:7]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s2, v4, v0, 0
-; GFX10-NEXT:    v_mov_b32_e32 v0, v3
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v4, v1, v[0:1]
-; GFX10-NEXT:    v_mov_b32_e32 v3, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10-NEXT:    global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v4, v1, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s2, v4, v2, v[1:2]
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: v_mul_i64_zext_src0:
@@ -135,14 +127,13 @@ define amdgpu_kernel void @v_mul_i64_zext_src0(ptr addrspace(1) %out, ptr addrsp
 ; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    global_load_b32 v5, v1, s[2:3]
-; GFX11-NEXT:    global_load_b64 v[0:1], v0, s[4:5]
+; GFX11-NEXT:    global_load_b64 v[1:2], v0, s[4:5]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v5, v0, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v5, v1, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v2, v[1:2]
+; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
 ; GFX11-NEXT:    s_endpgm
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.a = getelementptr inbounds i32, ptr addrspace(1) %aptr, i32 %tid
@@ -209,18 +200,16 @@ define amdgpu_kernel void @v_mul_i64_masked_src0_hi(ptr addrspace(1) %out, ptr a
 ; GFX10-NEXT:    s_clause 0x1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX10-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT:    v_lshlrev_b32_e32 v2, 3, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    global_load_dword v4, v2, s[2:3]
-; GFX10-NEXT:    global_load_dwordx2 v[0:1], v2, s[6:7]
+; GFX10-NEXT:    global_load_dword v3, v0, s[2:3]
+; GFX10-NEXT:    global_load_dwordx2 v[1:2], v0, s[6:7]
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s2, v4, v0, 0
-; GFX10-NEXT:    v_mov_b32_e32 v0, v3
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v4, v1, v[0:1]
-; GFX10-NEXT:    v_mov_b32_e32 v3, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10-NEXT:    global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v3, v1, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s2, v3, v2, v[1:2]
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: v_mul_i64_masked_src0_hi:
@@ -234,14 +223,13 @@ define amdgpu_kernel void @v_mul_i64_masked_src0_hi(ptr addrspace(1) %out, ptr a
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_clause 0x1
 ; GFX11-NEXT:    global_load_b32 v5, v0, s[2:3]
-; GFX11-NEXT:    global_load_b64 v[0:1], v0, s[4:5]
+; GFX11-NEXT:    global_load_b64 v[1:2], v0, s[4:5]
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
-; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v5, v0, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v5, v1, 0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mov_b32_e32 v0, v3
-; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
-; GFX11-NEXT:    v_mov_b32_e32 v0, 0
-; GFX11-NEXT:    global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[3:4], null, v5, v2, v[1:2]
+; GFX11-NEXT:    v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
 ; GFX11-NEXT:    s_endpgm
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -389,22 +377,20 @@ define amdgpu_kernel void @v_mul_i64_partially_masked_src0(ptr addrspace(1) %out
 ; GFX10-NEXT:    s_clause 0x1
 ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x24
 ; GFX10-NEXT:    s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT:    v_lshlrev_b32_e32 v4, 3, v0
+; GFX10-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX10-NEXT:    s_clause 0x1
-; GFX10-NEXT:    global_load_dwordx2 v[0:1], v4, s[2:3]
-; GFX10-NEXT:    global_load_dwordx2 v[2:3], v4, s[6:7]
+; GFX10-NEXT:    global_load_dwordx2 v[1:2], v0, s[2:3]
+; GFX10-NEXT:    global_load_dwordx2 v[3:4], v0, s[6:7]
 ; GFX10-NEXT:    s_waitcnt vmcnt(1)
-; GFX10-NEXT:    v_and_b32_e32 v6, 0xfff00000, v0
+; GFX10-NEXT:    v_and_b32_e32 v5, 0xfff00000, v1
 ; GFX10-NEXT:    s_waitcnt vmcnt(0)
-; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s2, v6, v2, 0
-; GFX10-NEXT:    v_mov_b32_e32 v0, v5
-; GFX10-NEXT:    v_mad_u64_u32 v[5:6], s2, v6, v3, v[0:1]
-; GFX10-NEXT:    v_and_b32_e32 v0, 0xf00f, v1
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v0, v2, v[5:6]
-; GFX10-NEXT:    v_mov_b32_e32 v5, v0
-; GFX10-NEXT:    v_mov_b32_e32 v0, 0
-; GFX10-NEXT:    global_store_dwordx2 v0, v[4:5], s[0:1]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s2, v5, v3, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[4:5], s2, v5, v4, v[1:2]
+; GFX10-NEXT:    v_and_b32_e32 v1, 0xf00f, v2
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s2, v1, v3, v[4:5]
+; GFX10-NEXT:    v_mov_b32_e32 v2, 0
+; GFX10-NEXT:    global_store_dwordx2 v2, v[0:1], s[0:1]
 ; GFX10-NEXT:    s_endpgm
 ;
 ; GFX11-LABEL: v_mul_i64_partially_masked_src0:
@@ -414,24 +400,22 @@ define amdgpu_kernel void @v_mul_i64_partially_masked_src0(ptr addrspace(1) %out
 ; GFX11-NEXT:    s_load_b64 s[4:5], s[4:5], 0x34
 ; GFX11-NEXT:    v_and_b32_e32 v0, 0x3ff, v0
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_lshlrev_b32_e32 v2, 3, v0
+; GFX11-NEXT:    v_lshlrev_b32_e32 v0, 3, v0
 ; GFX11-NEXT:    s_waitcnt lgkmcnt(0)
 ; GFX11-NEXT:    s_clause 0x1
-; GFX11-NEXT:    global_load_b64 v[0:1], v2, s[2:3]
-; GFX11-NEXT:    global_load_b64 v[2:3], v2, s[4:5]
+; GFX11-NEXT:    global_load_b64 v[1:2], v0, s[2:3]
+; GFX11-NEXT:    global_load_b64 v[3:4], v0, s[4:5]
 ; GFX11-NEXT:    s_waitcnt vmcnt(1)
-; GFX11-NEXT:    v_and_b32_e32 v7, 0xfff00000, v0
+; GFX11-NEXT:    v_and_b32_e32 v7, 0xfff00000, v1
 ; GFX11-NEXT:    s_waitcnt vmcnt(0)
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[4:5], null, v7, v2, 0
-; GFX11-NEXT:    v_mov_b32_e32 v0, v5
-; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v7, v3, v[0:1]
-; GFX11-NEXT:    v_and_b32_e32 v3, 0xf00f, v1
-; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v3, v2, v[5:6]
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v7, v3, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[5:6], null, v7, v4, v[1:2]
+; GFX11-NEXT:    v_and_b32_e32 v4, 0xf00f, v2
 ; GFX11-NEXT:    s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT:    v_dual_mov_b32 v5, v0 :: v_dual_mov_b32 v0, 0
-; GFX11-NEXT:    global_store_b64 v0, v[4:5], s[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v4, v3, v[5:6]
+; GFX11-NEXT:    v_mov_b32_e32 v2, 0
+; GFX11-NEXT:    global_store_b64 v2, v[0:1], s[0:1]
 ; GFX11-NEXT:    s_endpgm
   %tid = call i32 @llvm.amdgcn.workitem.id.x()
   %gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 637aaf7529364..81a8cdc12359d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -740,12 +740,12 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
 ; GCN-NEXT:    v_mov_b32_e32 v6, v0
 ; GCN-NEXT:    v_mov_b32_e32 v7, v1
 ; GCN-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v6, v5, 0
-; GCN-NEXT:    v_mad_u64_u32 v[8:9], s[4:5], v7, v4, v[0:1]
-; GCN-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v6, v3, 0
-; GCN-NEXT:    v_mad_u64_u32 v[8:9], s[4:5], v2, v3, v[8:9]
-; GCN-NEXT:    v_mov_b32_e32 v2, v8
+; GCN-NEXT:    v_mov_b32_e32 v8, v3
+; GCN-NEXT:    v_mad_u64_u32 v[9:10], s[4:5], v7, v4, v[0:1]
+; GCN-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v6, v8, 0
+; GCN-NEXT:    v_mad_u64_u32 v[2:3], s[4:5], v2, v8, v[9:10]
 ; GCN-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v6, v4, v[1:2]
-; GCN-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v7, v3, v[1:2]
+; GCN-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v7, v8, v[1:2]
 ; GCN-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX10-LABEL: v_mul_i96:
@@ -753,26 +753,26 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
 ; GFX10-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX10-NEXT:    v_mov_b32_e32 v6, v0
 ; GFX10-NEXT:    v_mov_b32_e32 v7, v1
+; GFX10-NEXT:    v_mov_b32_e32 v8, v3
 ; GFX10-NEXT:    v_mul_lo_u32 v0, v6, v5
-; GFX10-NEXT:    v_mad_u64_u32 v[8:9], s4, v7, v4, v[0:1]
-; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s4, v6, v3, 0
-; GFX10-NEXT:    v_mad_u64_u32 v[8:9], s4, v2, v3, v[8:9]
-; GFX10-NEXT:    v_mov_b32_e32 v2, v8
+; GFX10-NEXT:    v_mad_u64_u32 v[9:10], s4, v7, v4, v[0:1]
+; GFX10-NEXT:    v_mad_u64_u32 v[0:1], s4, v6, v8, 0
+; GFX10-NEXT:    v_mad_u64_u32 v[2:3], s4, v2, v8, v[9:10]
 ; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s4, v6, v4, v[1:2]
-; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s4, v7, v3, v[1:2]
+; GFX10-NEXT:    v_mad_u64_u32 v[1:2], s4, v7, v8, v[1:2]
 ; GFX10-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX11-LABEL: v_mul_i96:
 ; GFX11:       ; %bb.0:
 ; GFX11-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
 ; GFX11-NEXT:    v_dual_mov_b32 v6, v0 :: v_dual_mov_b32 v7, v1
+; GFX11-NEXT:    v_dual_mov_b32 v8, v2 :: v_dual_mov_b32 v9, v3
 ; GFX11-NEXT:    v_mul_lo_u32 v0, v6, v5
-; GFX11-NEXT:    v_mad_u64_u32 v[8:9], null, v7, v4, v[0:1]
-; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v6, v3, 0
-; GFX11-NEXT:    v_mad_u64_u32 v[9:10], null, v2, v3, v[8:9]
-; GFX11-NEXT:    v_mov_b32_e32 v2, v9
+; GFX11-NEXT:    v_mad_u64_u32 v[10:11], null, v7, v4, v[0:1]
+; GFX11-NEXT:    v_mad_u64_u32 v[0:1], null, v6, v9, 0
+; GFX11-NEXT:    v_mad_u64_u32 v[2:3], null, v8, v9, v[10:11]
 ; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v6, v4, v[1:2]
-; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v7, v3, v[1:2]
+; GFX11-NEXT:    v_mad_u64_u32 v[1:2], null, v7, v9, v[1:2]
 ; GFX11-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX12-LABEL: v_mul_i96:
@@ -783,16 +783,16 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
 ; GFX12-NEXT:    s_wait_bvhcnt 0x0
 ; GFX12-NEXT:    s_wait_kmcnt 0x0
 ; GFX12-NEXT:    v_dual_mov_b32 v6, v0 :: v_dual_mov_b32 v7, v1
-; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_mul_lo_u32 v0, v6, v5
-; GFX12-NEXT:    v_mad_co_u64_u32 v[8:9], null, v7, v4, v[0:1]
-; GFX12-NEXT:    v_mad_co_u64_u32 v[0:1], null, v6, v3, 0
+; GFX12-NEXT:    v_mov_b32_e32 v8, v3
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT:    v_mad_co_u64_u32 v[8:9], null, v2, v3, v[8:9]
-; GFX12-NEXT:    v_mov_b32_e32 v2, v8
+; GFX12-NEXT:    v_mul_lo_u32 v0, v6, v5
+; GFX12-NEXT:    v_mad_co_u64_u32 v[9:10], null, v7, v4, v[0:1]
+; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-NEXT:    v_mad_co_u64_u32 v[0:1], null, v6, v8, 0
+; GFX12-NEXT:    v_mad_co_u64_u32 v[2:3], null, v2, v8, v[9:10]
 ; GFX12-NEXT:    s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
 ; GFX12-NEXT:    v_mad_co_u64_u32 v[1:2], null, v6, v4, v[1:2]
-; GFX12-NEXT:    v_mad_co_u64_u32 v[1:2], null, v7, v3, v[1:2]
+; GFX12-NEXT:    v_mad_co_u64_u32 v[1:2], null, v7, v8, v[1:2]
 ; GFX12-NEXT:    s_setpc_b64 s[30:31]
 ;
 ; GFX1250-LABEL: v_mul_i96:
@@ -1071,18 +1071,18 @@ define i128 @v_mul_i128(i128 %num, i128 %den) {
 ; GFX7-NEXT:    v_mov_b32_e32 v9, v1
 ; GFX7-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v8, v6, 0
 ; GFX7-NEXT:    v_mov_b32_e32 v10, v2
-; GFX7-NEXT:    v_mul_lo_u32 v7, v8, v7
-; GFX7-NEXT:    v_mad_u64_u32 v[11:12], s[4:5], v9, v5, v[0:1]
+; GFX7-NEXT:    v_mov_b32_e32 v11, v3
+; GFX7-NEXT:    v_mad_u64_u32 v[2:3], s[4:5], v9, v5, v[0:1]
 ; GFX7-NEXT:    v_mad_u64_u32 v[0:1], s[4:5], v8, v4, 0
-; GFX7-NEXT:    v_mad_u64_u32 v[11:12], s[4:5], v10, v4, v[11:12]
+; GFX7-NEXT:    v_mad_u64_u32 v[2:3], s[4:5], v10, v4, v[2:3]
+; GFX7-NEXT:    v_mul_lo_u32 v7, v8, v7
 ; GFX7-NEXT:    v_mul_lo_u32 v6, v9, v6
-; GFX7-NEXT:    v_mov_b32_e32 v2, v11
 ; GFX7-NEXT:    v_mad_u64_u32 v[1:2], vcc, v8, v5, v[1:2]
 ; GFX7-NEXT:    v_mad_u64_u32 v[1:2], s[4:5], v9, v4, v[1:2]
-; GFX7-NEXT:    v_addc_u32_e64 v7, s[4:5], v12, v7, s[4:5]
-; GFX7-NEXT:    v_addc_u32_e32 v6, vcc, v7, v6, vcc
-; GFX7-NEXT:    v_mad_u64_u32 v[5:6], s[4:5], v10, v5, v[6:7]
-; GFX7-NEXT:    v_mad_u64_u32 v[3:4], s[4:5], v3, v4, v[5:6]
+; GFX7-NEXT:    v_addc_u32_e64 v3, s[4:5], v3, v7, s[4:5]
+; GFX7-NEXT:    v_addc_u32_e32 v3, vcc, v3, v6, vcc
+; GFX7-NEXT:    v_m...
[truncated]

; GFX12-NEXT: s_wait_kmcnt 0x0
; GFX12-NEXT: v_dual_mov_b32 v8, v0 :: v_dual_mov_b32 v9, v1
; GFX12-NEXT: v_mov_b32_e32 v10, v2
; GFX12-NEXT: v_dual_mov_b32 v10, v2 :: v_dual_mov_b32 v11, v3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

; CHECK-NEXT: flat_store_b64 v[0:1], v[2:3]
; CHECK-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v2, s1
; CHECK-NEXT: v_mov_b32_e32 v0, v1
; CHECK-NEXT: flat_store_b64 v[1:2], v[0:1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good, less instructions and registers used!

@arsenm arsenm force-pushed the users/arsenm/amdgpu/relax-shouldCoalesce branch from 6cab444 to 7af5764 Compare November 5, 2025 19:55
@arsenm
Copy link
Contributor Author

arsenm commented Nov 10, 2025

ping

Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.

This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.

The coalescer does not just widen registers out of nowhere. If it's trying
to "widen" a register, it's generally packing a register into an existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a110 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ea didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.

I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.
@arsenm arsenm force-pushed the users/arsenm/amdgpu/relax-shouldCoalesce branch from 7af5764 to 9c1ccff Compare November 10, 2025 18:59
@qcolombet
Copy link
Collaborator

What's the rationale for the special case for 128-bit?

@arsenm
Copy link
Contributor Author

arsenm commented Nov 10, 2025

What's the rationale for the special case for 128-bit?

128-bit load/store is common, and it's still under the allocation granule

Copy link
Collaborator

@rampitec rampitec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arsenm arsenm merged commit bbde792 into main Nov 11, 2025
10 checks passed
@arsenm arsenm deleted the users/arsenm/amdgpu/relax-shouldCoalesce branch November 11, 2025 21:51
@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 11, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-devrel-x86-64 running on ml-opt-devrel-x86-64-b1 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/28647

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 3
/b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti < /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll | /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
# executed command: /b/ml-opt-devrel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti
# executed command: /b/ml-opt-devrel-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
# .---command stderr------------
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:2275:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:1382:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:1383:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:5466:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:2951:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:2952:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:8370:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:4335:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:4336:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:10949:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:5525:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:5526:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-devrel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:13165:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v6, 1.0, s17
# |            ^
# | <stdin>:6458:28: note: scanning from here
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 11, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-rel-x86-64 running on ml-opt-rel-x86-64-b2 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/28668

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 3
/b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti < /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll | /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
# executed command: /b/ml-opt-rel-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti
# executed command: /b/ml-opt-rel-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
# .---command stderr------------
# | /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll:18738:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v14, 1.0, s17
# |            ^
# | <stdin>:8216:29: note: scanning from here
# |  v_mul_f32_e64 v15, 1.0, s16
# |                             ^
# | <stdin>:8217:2: note: possible intended match here
# |  v_mul_f32_e64 v1, 1.0, s17
# |  ^
# | /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll:24558:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v23, 1.0, s16
# |            ^
# | <stdin>:10080:29: note: scanning from here
# |  v_mul_f32_e64 v28, 1.0, s17
# |                             ^
# | <stdin>:10081:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s16
# |  ^
# | 
# | Input file: <stdin>
# | Check file: /b/ml-opt-rel-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.128bit.ll
# | 
# | -dump-input=help explains the following input dump.
# | 
# | Input was:
# | <<<<<<
# |               .
# |               .
# |               .
# |            8211:  .type bitcast_v8bf16_to_v8i16_scalar,@function 
# |            8212: bitcast_v8bf16_to_v8i16_scalar: ; @bitcast_v8bf16_to_v8i16_scalar 
# |            8213: ; %bb.0: 
# |            8214:  s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) 
# |            8215:  s_cmp_lg_u32 s24, 0 
# |            8216:  v_mul_f32_e64 v15, 1.0, s16 
# | next:18738'0                                 X error: no match found
# |            8217:  v_mul_f32_e64 v1, 1.0, s17 
# | next:18738'0     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# | next:18738'1      ?                           possible intended match
# |            8218:  v_mul_f32_e64 v9, 1.0, s18 
...

@llvm-ci
Copy link
Collaborator

llvm-ci commented Nov 11, 2025

LLVM Buildbot has detected a new failure on builder ml-opt-dev-x86-64 running on ml-opt-dev-x86-64-b1 while building llvm at step 6 "test-build-unified-tree-check-all".

Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/28862

Here is the relevant piece of the build log for the reference
Step 6 (test-build-unified-tree-check-all) failure: test (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 3
/b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti < /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll | /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
# executed command: /b/ml-opt-dev-x86-64-b1/build/bin/llc -mtriple=amdgcn -mcpu=tahiti
# executed command: /b/ml-opt-dev-x86-64-b1/build/bin/FileCheck -check-prefix=SI /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll
# .---command stderr------------
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:2275:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:1382:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:1383:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:5466:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:2951:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:2952:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:8370:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:4335:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:4336:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:10949:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v7, 1.0, s17
# |            ^
# | <stdin>:5525:21: note: scanning from here
# |  s_cmp_lg_u32 s20, 0
# |                     ^
# | <stdin>:5526:2: note: possible intended match here
# |  v_mul_f32_e64 v8, 1.0, s17
# |  ^
# | /b/ml-opt-dev-x86-64-b1/llvm-project/llvm/test/CodeGen/AMDGPU/amdgcn.bitcast.64bit.ll:13165:12: error: SI-NEXT: expected string not found in input
# | ; SI-NEXT: v_mul_f32_e64 v6, 1.0, s17
# |            ^
# | <stdin>:6458:28: note: scanning from here
...

@jplehr
Copy link
Contributor

jplehr commented Nov 11, 2025

Also impacted some of our bots, e.g.: https://lab.llvm.org/buildbot/#/builders/10

@arsenm
Copy link
Contributor Author

arsenm commented Nov 11, 2025

Fixed by #167590

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants