-
Notifications
You must be signed in to change notification settings - Fork 15.2k
AMDGPU: Relax shouldCoalesce to allow more register tuple widening #166475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
@llvm/pr-subscribers-llvm-globalisel @llvm/pr-subscribers-backend-amdgpu Author: Matt Arsenault (arsenm) ChangesAllow widening up to 128-bit registers or if the new register class This was artificially limiting. In particular this was doing the wrong The coalescer does not just widen registers out of nowhere. If it's trying I get the best results by removing this override completely. This is Patch is 1.21 MiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/166475.diff 48 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
index a6c1af24e13e9..152f0f85c9978 100644
--- a/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
@@ -3741,18 +3741,11 @@ bool SIRegisterInfo::shouldCoalesce(MachineInstr *MI,
unsigned DstSubReg,
const TargetRegisterClass *NewRC,
LiveIntervals &LIS) const {
- unsigned SrcSize = getRegSizeInBits(*SrcRC);
- unsigned DstSize = getRegSizeInBits(*DstRC);
+ // TODO: This should be more aggressive, but be more cautious with very wide
+ // tuples.
unsigned NewSize = getRegSizeInBits(*NewRC);
-
- // Do not increase size of registers beyond dword, we would need to allocate
- // adjacent registers and constraint regalloc more than needed.
-
- // Always allow dword coalescing.
- if (SrcSize <= 32 || DstSize <= 32)
- return true;
-
- return NewSize <= DstSize || NewSize <= SrcSize;
+ return NewSize <= 128 || NewSize <= getRegSizeInBits(*SrcRC) ||
+ NewSize <= getRegSizeInBits(*DstRC);
}
unsigned SIRegisterInfo::getRegPressureLimit(const TargetRegisterClass *RC,
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
index 1cd9c0bfeb7e6..85c6035e3d840 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul-known-bits.i64.ll
@@ -8,17 +8,16 @@ define amdgpu_kernel void @v_mul_i64_no_zext(ptr addrspace(1) %out, ptr addrspac
; GFX10-LABEL: v_mul_i64_no_zext:
; GFX10: ; %bb.0:
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x2c
-; GFX10-NEXT: v_lshlrev_b32_e32 v7, 3, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v6, 3, v0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_clause 0x1
-; GFX10-NEXT: global_load_dwordx2 v[0:1], v7, s[0:1]
-; GFX10-NEXT: global_load_dwordx2 v[2:3], v7, s[2:3]
+; GFX10-NEXT: global_load_dwordx2 v[2:3], v6, s[0:1]
+; GFX10-NEXT: global_load_dwordx2 v[4:5], v6, s[2:3]
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_mad_u64_u32 v[4:5], s0, v0, v2, 0
-; GFX10-NEXT: v_mad_u64_u32 v[5:6], s0, v0, v3, v[5:6]
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s0, v1, v2, v[5:6]
-; GFX10-NEXT: v_mov_b32_e32 v5, v0
-; GFX10-NEXT: global_store_dwordx2 v7, v[4:5], s[2:3]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s0, v2, v4, 0
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s0, v2, v5, v[1:2]
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s0, v3, v4, v[1:2]
+; GFX10-NEXT: global_store_dwordx2 v6, v[0:1], s[2:3]
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: v_mul_i64_no_zext:
@@ -26,19 +25,17 @@ define amdgpu_kernel void @v_mul_i64_no_zext(ptr addrspace(1) %out, ptr addrspac
; GFX11-NEXT: s_load_b128 s[0:3], s[4:5], 0x2c
; GFX11-NEXT: v_and_b32_e32 v0, 0x3ff, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT: v_lshlrev_b32_e32 v9, 3, v0
+; GFX11-NEXT: v_lshlrev_b32_e32 v8, 3, v0
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_clause 0x1
-; GFX11-NEXT: global_load_b64 v[0:1], v9, s[0:1]
-; GFX11-NEXT: global_load_b64 v[2:3], v9, s[2:3]
+; GFX11-NEXT: global_load_b64 v[2:3], v8, s[0:1]
+; GFX11-NEXT: global_load_b64 v[4:5], v8, s[2:3]
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: v_mad_u64_u32 v[4:5], null, v0, v2, 0
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v2, v4, 0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mad_u64_u32 v[6:7], null, v0, v3, v[5:6]
-; GFX11-NEXT: v_mad_u64_u32 v[7:8], null, v1, v2, v[6:7]
-; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT: v_mov_b32_e32 v5, v7
-; GFX11-NEXT: global_store_b64 v9, v[4:5], s[2:3]
+; GFX11-NEXT: v_mad_u64_u32 v[6:7], null, v2, v5, v[1:2]
+; GFX11-NEXT: v_mad_u64_u32 v[1:2], null, v3, v4, v[6:7]
+; GFX11-NEXT: global_store_b64 v8, v[0:1], s[2:3]
; GFX11-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -58,18 +55,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src1(ptr addrspace(1) %out, ptr addrsp
; GFX10-NEXT: s_clause 0x1
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT: v_lshlrev_b32_e32 v2, 3, v0
-; GFX10-NEXT: v_lshlrev_b32_e32 v3, 2, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v3, 3, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
-; GFX10-NEXT: global_load_dwordx2 v[0:1], v2, s[2:3]
-; GFX10-NEXT: global_load_dword v4, v3, s[6:7]
+; GFX10-NEXT: global_load_dwordx2 v[1:2], v3, s[2:3]
+; GFX10-NEXT: global_load_dword v4, v0, s[6:7]
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_mad_u64_u32 v[2:3], s2, v0, v4, 0
-; GFX10-NEXT: v_mov_b32_e32 v0, v3
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v1, v4, v[0:1]
-; GFX10-NEXT: v_mov_b32_e32 v3, v0
-; GFX10-NEXT: v_mov_b32_e32 v0, 0
-; GFX10-NEXT: global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v1, v4, 0
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s2, v2, v4, v[1:2]
+; GFX10-NEXT: v_mov_b32_e32 v2, 0
+; GFX10-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: v_mul_i64_zext_src1:
@@ -80,17 +75,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src1(ptr addrspace(1) %out, ptr addrsp
; GFX11-NEXT: v_and_b32_e32 v0, 0x3ff, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-NEXT: v_lshlrev_b32_e32 v1, 3, v0
-; GFX11-NEXT: v_lshlrev_b32_e32 v2, 2, v0
+; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
-; GFX11-NEXT: global_load_b64 v[0:1], v1, s[2:3]
-; GFX11-NEXT: global_load_b32 v5, v2, s[4:5]
+; GFX11-NEXT: global_load_b64 v[1:2], v1, s[2:3]
+; GFX11-NEXT: global_load_b32 v5, v0, s[4:5]
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: v_mad_u64_u32 v[2:3], null, v0, v5, 0
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v1, v5, 0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mov_b32_e32 v0, v3
-; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v1, v5, v[0:1]
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v2, v5, v[1:2]
+; GFX11-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX11-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -110,18 +104,16 @@ define amdgpu_kernel void @v_mul_i64_zext_src0(ptr addrspace(1) %out, ptr addrsp
; GFX10-NEXT: s_clause 0x1
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT: v_lshlrev_b32_e32 v2, 2, v0
-; GFX10-NEXT: v_lshlrev_b32_e32 v3, 3, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v3, 2, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v0, 3, v0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
-; GFX10-NEXT: global_load_dword v4, v2, s[2:3]
-; GFX10-NEXT: global_load_dwordx2 v[0:1], v3, s[6:7]
+; GFX10-NEXT: global_load_dword v4, v3, s[2:3]
+; GFX10-NEXT: global_load_dwordx2 v[1:2], v0, s[6:7]
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_mad_u64_u32 v[2:3], s2, v4, v0, 0
-; GFX10-NEXT: v_mov_b32_e32 v0, v3
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v4, v1, v[0:1]
-; GFX10-NEXT: v_mov_b32_e32 v3, v0
-; GFX10-NEXT: v_mov_b32_e32 v0, 0
-; GFX10-NEXT: global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v4, v1, 0
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s2, v4, v2, v[1:2]
+; GFX10-NEXT: v_mov_b32_e32 v2, 0
+; GFX10-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: v_mul_i64_zext_src0:
@@ -135,14 +127,13 @@ define amdgpu_kernel void @v_mul_i64_zext_src0(ptr addrspace(1) %out, ptr addrsp
; GFX11-NEXT: v_lshlrev_b32_e32 v0, 3, v0
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: global_load_b32 v5, v1, s[2:3]
-; GFX11-NEXT: global_load_b64 v[0:1], v0, s[4:5]
+; GFX11-NEXT: global_load_b64 v[1:2], v0, s[4:5]
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: v_mad_u64_u32 v[2:3], null, v5, v0, 0
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v5, v1, 0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mov_b32_e32 v0, v3
-; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v5, v2, v[1:2]
+; GFX11-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX11-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.a = getelementptr inbounds i32, ptr addrspace(1) %aptr, i32 %tid
@@ -209,18 +200,16 @@ define amdgpu_kernel void @v_mul_i64_masked_src0_hi(ptr addrspace(1) %out, ptr a
; GFX10-NEXT: s_clause 0x1
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT: v_lshlrev_b32_e32 v2, 3, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v0, 3, v0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_clause 0x1
-; GFX10-NEXT: global_load_dword v4, v2, s[2:3]
-; GFX10-NEXT: global_load_dwordx2 v[0:1], v2, s[6:7]
+; GFX10-NEXT: global_load_dword v3, v0, s[2:3]
+; GFX10-NEXT: global_load_dwordx2 v[1:2], v0, s[6:7]
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_mad_u64_u32 v[2:3], s2, v4, v0, 0
-; GFX10-NEXT: v_mov_b32_e32 v0, v3
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v4, v1, v[0:1]
-; GFX10-NEXT: v_mov_b32_e32 v3, v0
-; GFX10-NEXT: v_mov_b32_e32 v0, 0
-; GFX10-NEXT: global_store_dwordx2 v0, v[2:3], s[0:1]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v3, v1, 0
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s2, v3, v2, v[1:2]
+; GFX10-NEXT: v_mov_b32_e32 v2, 0
+; GFX10-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: v_mul_i64_masked_src0_hi:
@@ -234,14 +223,13 @@ define amdgpu_kernel void @v_mul_i64_masked_src0_hi(ptr addrspace(1) %out, ptr a
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_clause 0x1
; GFX11-NEXT: global_load_b32 v5, v0, s[2:3]
-; GFX11-NEXT: global_load_b64 v[0:1], v0, s[4:5]
+; GFX11-NEXT: global_load_b64 v[1:2], v0, s[4:5]
; GFX11-NEXT: s_waitcnt vmcnt(0)
-; GFX11-NEXT: v_mad_u64_u32 v[2:3], null, v5, v0, 0
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v5, v1, 0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mov_b32_e32 v0, v3
-; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v5, v1, v[0:1]
-; GFX11-NEXT: v_mov_b32_e32 v0, 0
-; GFX11-NEXT: global_store_b64 v0, v[2:3], s[0:1]
+; GFX11-NEXT: v_mad_u64_u32 v[3:4], null, v5, v2, v[1:2]
+; GFX11-NEXT: v_dual_mov_b32 v2, 0 :: v_dual_mov_b32 v1, v3
+; GFX11-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX11-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
@@ -389,22 +377,20 @@ define amdgpu_kernel void @v_mul_i64_partially_masked_src0(ptr addrspace(1) %out
; GFX10-NEXT: s_clause 0x1
; GFX10-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
; GFX10-NEXT: s_load_dwordx2 s[6:7], s[4:5], 0x34
-; GFX10-NEXT: v_lshlrev_b32_e32 v4, 3, v0
+; GFX10-NEXT: v_lshlrev_b32_e32 v0, 3, v0
; GFX10-NEXT: s_waitcnt lgkmcnt(0)
; GFX10-NEXT: s_clause 0x1
-; GFX10-NEXT: global_load_dwordx2 v[0:1], v4, s[2:3]
-; GFX10-NEXT: global_load_dwordx2 v[2:3], v4, s[6:7]
+; GFX10-NEXT: global_load_dwordx2 v[1:2], v0, s[2:3]
+; GFX10-NEXT: global_load_dwordx2 v[3:4], v0, s[6:7]
; GFX10-NEXT: s_waitcnt vmcnt(1)
-; GFX10-NEXT: v_and_b32_e32 v6, 0xfff00000, v0
+; GFX10-NEXT: v_and_b32_e32 v5, 0xfff00000, v1
; GFX10-NEXT: s_waitcnt vmcnt(0)
-; GFX10-NEXT: v_mad_u64_u32 v[4:5], s2, v6, v2, 0
-; GFX10-NEXT: v_mov_b32_e32 v0, v5
-; GFX10-NEXT: v_mad_u64_u32 v[5:6], s2, v6, v3, v[0:1]
-; GFX10-NEXT: v_and_b32_e32 v0, 0xf00f, v1
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v0, v2, v[5:6]
-; GFX10-NEXT: v_mov_b32_e32 v5, v0
-; GFX10-NEXT: v_mov_b32_e32 v0, 0
-; GFX10-NEXT: global_store_dwordx2 v0, v[4:5], s[0:1]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s2, v5, v3, 0
+; GFX10-NEXT: v_mad_u64_u32 v[4:5], s2, v5, v4, v[1:2]
+; GFX10-NEXT: v_and_b32_e32 v1, 0xf00f, v2
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s2, v1, v3, v[4:5]
+; GFX10-NEXT: v_mov_b32_e32 v2, 0
+; GFX10-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
; GFX10-NEXT: s_endpgm
;
; GFX11-LABEL: v_mul_i64_partially_masked_src0:
@@ -414,24 +400,22 @@ define amdgpu_kernel void @v_mul_i64_partially_masked_src0(ptr addrspace(1) %out
; GFX11-NEXT: s_load_b64 s[4:5], s[4:5], 0x34
; GFX11-NEXT: v_and_b32_e32 v0, 0x3ff, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT: v_lshlrev_b32_e32 v2, 3, v0
+; GFX11-NEXT: v_lshlrev_b32_e32 v0, 3, v0
; GFX11-NEXT: s_waitcnt lgkmcnt(0)
; GFX11-NEXT: s_clause 0x1
-; GFX11-NEXT: global_load_b64 v[0:1], v2, s[2:3]
-; GFX11-NEXT: global_load_b64 v[2:3], v2, s[4:5]
+; GFX11-NEXT: global_load_b64 v[1:2], v0, s[2:3]
+; GFX11-NEXT: global_load_b64 v[3:4], v0, s[4:5]
; GFX11-NEXT: s_waitcnt vmcnt(1)
-; GFX11-NEXT: v_and_b32_e32 v7, 0xfff00000, v0
+; GFX11-NEXT: v_and_b32_e32 v7, 0xfff00000, v1
; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mad_u64_u32 v[4:5], null, v7, v2, 0
-; GFX11-NEXT: v_mov_b32_e32 v0, v5
-; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_1)
-; GFX11-NEXT: v_mad_u64_u32 v[5:6], null, v7, v3, v[0:1]
-; GFX11-NEXT: v_and_b32_e32 v3, 0xf00f, v1
-; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v3, v2, v[5:6]
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v7, v3, 0
+; GFX11-NEXT: v_mad_u64_u32 v[5:6], null, v7, v4, v[1:2]
+; GFX11-NEXT: v_and_b32_e32 v4, 0xf00f, v2
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
-; GFX11-NEXT: v_dual_mov_b32 v5, v0 :: v_dual_mov_b32 v0, 0
-; GFX11-NEXT: global_store_b64 v0, v[4:5], s[0:1]
+; GFX11-NEXT: v_mad_u64_u32 v[1:2], null, v4, v3, v[5:6]
+; GFX11-NEXT: v_mov_b32_e32 v2, 0
+; GFX11-NEXT: global_store_b64 v2, v[0:1], s[0:1]
; GFX11-NEXT: s_endpgm
%tid = call i32 @llvm.amdgcn.workitem.id.x()
%gep.a = getelementptr inbounds i64, ptr addrspace(1) %aptr, i32 %tid
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
index 637aaf7529364..81a8cdc12359d 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/mul.ll
@@ -740,12 +740,12 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
; GCN-NEXT: v_mov_b32_e32 v6, v0
; GCN-NEXT: v_mov_b32_e32 v7, v1
; GCN-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v5, 0
-; GCN-NEXT: v_mad_u64_u32 v[8:9], s[4:5], v7, v4, v[0:1]
-; GCN-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v3, 0
-; GCN-NEXT: v_mad_u64_u32 v[8:9], s[4:5], v2, v3, v[8:9]
-; GCN-NEXT: v_mov_b32_e32 v2, v8
+; GCN-NEXT: v_mov_b32_e32 v8, v3
+; GCN-NEXT: v_mad_u64_u32 v[9:10], s[4:5], v7, v4, v[0:1]
+; GCN-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v8, 0
+; GCN-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v2, v8, v[9:10]
; GCN-NEXT: v_mad_u64_u32 v[1:2], s[4:5], v6, v4, v[1:2]
-; GCN-NEXT: v_mad_u64_u32 v[1:2], s[4:5], v7, v3, v[1:2]
+; GCN-NEXT: v_mad_u64_u32 v[1:2], s[4:5], v7, v8, v[1:2]
; GCN-NEXT: s_setpc_b64 s[30:31]
;
; GFX10-LABEL: v_mul_i96:
@@ -753,26 +753,26 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: v_mov_b32_e32 v6, v0
; GFX10-NEXT: v_mov_b32_e32 v7, v1
+; GFX10-NEXT: v_mov_b32_e32 v8, v3
; GFX10-NEXT: v_mul_lo_u32 v0, v6, v5
-; GFX10-NEXT: v_mad_u64_u32 v[8:9], s4, v7, v4, v[0:1]
-; GFX10-NEXT: v_mad_u64_u32 v[0:1], s4, v6, v3, 0
-; GFX10-NEXT: v_mad_u64_u32 v[8:9], s4, v2, v3, v[8:9]
-; GFX10-NEXT: v_mov_b32_e32 v2, v8
+; GFX10-NEXT: v_mad_u64_u32 v[9:10], s4, v7, v4, v[0:1]
+; GFX10-NEXT: v_mad_u64_u32 v[0:1], s4, v6, v8, 0
+; GFX10-NEXT: v_mad_u64_u32 v[2:3], s4, v2, v8, v[9:10]
; GFX10-NEXT: v_mad_u64_u32 v[1:2], s4, v6, v4, v[1:2]
-; GFX10-NEXT: v_mad_u64_u32 v[1:2], s4, v7, v3, v[1:2]
+; GFX10-NEXT: v_mad_u64_u32 v[1:2], s4, v7, v8, v[1:2]
; GFX10-NEXT: s_setpc_b64 s[30:31]
;
; GFX11-LABEL: v_mul_i96:
; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: v_dual_mov_b32 v6, v0 :: v_dual_mov_b32 v7, v1
+; GFX11-NEXT: v_dual_mov_b32 v8, v2 :: v_dual_mov_b32 v9, v3
; GFX11-NEXT: v_mul_lo_u32 v0, v6, v5
-; GFX11-NEXT: v_mad_u64_u32 v[8:9], null, v7, v4, v[0:1]
-; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v6, v3, 0
-; GFX11-NEXT: v_mad_u64_u32 v[9:10], null, v2, v3, v[8:9]
-; GFX11-NEXT: v_mov_b32_e32 v2, v9
+; GFX11-NEXT: v_mad_u64_u32 v[10:11], null, v7, v4, v[0:1]
+; GFX11-NEXT: v_mad_u64_u32 v[0:1], null, v6, v9, 0
+; GFX11-NEXT: v_mad_u64_u32 v[2:3], null, v8, v9, v[10:11]
; GFX11-NEXT: v_mad_u64_u32 v[1:2], null, v6, v4, v[1:2]
-; GFX11-NEXT: v_mad_u64_u32 v[1:2], null, v7, v3, v[1:2]
+; GFX11-NEXT: v_mad_u64_u32 v[1:2], null, v7, v9, v[1:2]
; GFX11-NEXT: s_setpc_b64 s[30:31]
;
; GFX12-LABEL: v_mul_i96:
@@ -783,16 +783,16 @@ define i96 @v_mul_i96(i96 %num, i96 %den) {
; GFX12-NEXT: s_wait_bvhcnt 0x0
; GFX12-NEXT: s_wait_kmcnt 0x0
; GFX12-NEXT: v_dual_mov_b32 v6, v0 :: v_dual_mov_b32 v7, v1
-; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT: v_mul_lo_u32 v0, v6, v5
-; GFX12-NEXT: v_mad_co_u64_u32 v[8:9], null, v7, v4, v[0:1]
-; GFX12-NEXT: v_mad_co_u64_u32 v[0:1], null, v6, v3, 0
+; GFX12-NEXT: v_mov_b32_e32 v8, v3
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_2) | instskip(NEXT) | instid1(VALU_DEP_1)
-; GFX12-NEXT: v_mad_co_u64_u32 v[8:9], null, v2, v3, v[8:9]
-; GFX12-NEXT: v_mov_b32_e32 v2, v8
+; GFX12-NEXT: v_mul_lo_u32 v0, v6, v5
+; GFX12-NEXT: v_mad_co_u64_u32 v[9:10], null, v7, v4, v[0:1]
+; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_3) | instskip(NEXT) | instid1(VALU_DEP_2)
+; GFX12-NEXT: v_mad_co_u64_u32 v[0:1], null, v6, v8, 0
+; GFX12-NEXT: v_mad_co_u64_u32 v[2:3], null, v2, v8, v[9:10]
; GFX12-NEXT: s_delay_alu instid0(VALU_DEP_1) | instskip(NEXT) | instid1(VALU_DEP_1)
; GFX12-NEXT: v_mad_co_u64_u32 v[1:2], null, v6, v4, v[1:2]
-; GFX12-NEXT: v_mad_co_u64_u32 v[1:2], null, v7, v3, v[1:2]
+; GFX12-NEXT: v_mad_co_u64_u32 v[1:2], null, v7, v8, v[1:2]
; GFX12-NEXT: s_setpc_b64 s[30:31]
;
; GFX1250-LABEL: v_mul_i96:
@@ -1071,18 +1071,18 @@ define i128 @v_mul_i128(i128 %num, i128 %den) {
; GFX7-NEXT: v_mov_b32_e32 v9, v1
; GFX7-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v8, v6, 0
; GFX7-NEXT: v_mov_b32_e32 v10, v2
-; GFX7-NEXT: v_mul_lo_u32 v7, v8, v7
-; GFX7-NEXT: v_mad_u64_u32 v[11:12], s[4:5], v9, v5, v[0:1]
+; GFX7-NEXT: v_mov_b32_e32 v11, v3
+; GFX7-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v9, v5, v[0:1]
; GFX7-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v8, v4, 0
-; GFX7-NEXT: v_mad_u64_u32 v[11:12], s[4:5], v10, v4, v[11:12]
+; GFX7-NEXT: v_mad_u64_u32 v[2:3], s[4:5], v10, v4, v[2:3]
+; GFX7-NEXT: v_mul_lo_u32 v7, v8, v7
; GFX7-NEXT: v_mul_lo_u32 v6, v9, v6
-; GFX7-NEXT: v_mov_b32_e32 v2, v11
; GFX7-NEXT: v_mad_u64_u32 v[1:2], vcc, v8, v5, v[1:2]
; GFX7-NEXT: v_mad_u64_u32 v[1:2], s[4:5], v9, v4, v[1:2]
-; GFX7-NEXT: v_addc_u32_e64 v7, s[4:5], v12, v7, s[4:5]
-; GFX7-NEXT: v_addc_u32_e32 v6, vcc, v7, v6, vcc
-; GFX7-NEXT: v_mad_u64_u32 v[5:6], s[4:5], v10, v5, v[6:7]
-; GFX7-NEXT: v_mad_u64_u32 v[3:4], s[4:5], v3, v4, v[5:6]
+; GFX7-NEXT: v_addc_u32_e64 v3, s[4:5], v3, v7, s[4:5]
+; GFX7-NEXT: v_addc_u32_e32 v3, vcc, v3, v6, vcc
+; GFX7-NEXT: v_m...
[truncated]
|
| ; GFX12-NEXT: s_wait_kmcnt 0x0 | ||
| ; GFX12-NEXT: v_dual_mov_b32 v8, v0 :: v_dual_mov_b32 v9, v1 | ||
| ; GFX12-NEXT: v_mov_b32_e32 v10, v2 | ||
| ; GFX12-NEXT: v_dual_mov_b32 v10, v2 :: v_dual_mov_b32 v11, v3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice!
| ; CHECK-NEXT: flat_store_b64 v[0:1], v[2:3] | ||
| ; CHECK-NEXT: v_dual_mov_b32 v1, 0 :: v_dual_mov_b32 v2, s1 | ||
| ; CHECK-NEXT: v_mov_b32_e32 v0, v1 | ||
| ; CHECK-NEXT: flat_store_b64 v[1:2], v[0:1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good, less instructions and registers used!
6cab444 to
7af5764
Compare
|
ping |
Allow widening up to 128-bit registers or if the new register class is at least as large as one of the existing register classes. This was artificially limiting. In particular this was doing the wrong thing with sequences involving copies between VGPRs and AV registers. Nearly all test changes are improvements. The coalescer does not just widen registers out of nowhere. If it's trying to "widen" a register, it's generally packing a register into an existing register tuple, or in a situation where the constraints imply the wider class anyway. 067a110 addressed the allocation failure concern by rejecting coalescing if there are no available registers. The original change in a4e63ea didn't include a realistic testcase to judge if this is harmful for pressure. I would expect any issues from this to be of garden variety subreg handling issue. We could use more dynamic state information here if it really is an issue. I get the best results by removing this override completely. This is a smaller step for patch splitting purposes.
7af5764 to
9c1ccff
Compare
|
What's the rationale for the special case for 128-bit? |
128-bit load/store is common, and it's still under the allocation granule |
rampitec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/175/builds/28647 Here is the relevant piece of the build log for the reference |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/185/builds/28668 Here is the relevant piece of the build log for the reference |
|
LLVM Buildbot has detected a new failure on builder Full details are available at: https://lab.llvm.org/buildbot/#/builders/137/builds/28862 Here is the relevant piece of the build log for the reference |
|
Also impacted some of our bots, e.g.: https://lab.llvm.org/buildbot/#/builders/10 |
|
Fixed by #167590 |

Allow widening up to 128-bit registers or if the new register class
is at least as large as one of the existing register classes.
This was artificially limiting. In particular this was doing the wrong
thing with sequences involving copies between VGPRs and AV registers.
Nearly all test changes are improvements.
The coalescer does not just widen registers out of nowhere. If it's trying
to "widen" a register, it's generally packing a register into an existing
register tuple, or in a situation where the constraints imply the wider
class anyway. 067a110 addressed the allocation failure concern by
rejecting coalescing if there are no available registers. The original
change in a4e63ea didn't include a realistic testcase to judge if
this is harmful for pressure. I would expect any issues from this to
be of garden variety subreg handling issue. We could use more dynamic
state information here if it really is an issue.
I get the best results by removing this override completely. This is
a smaller step for patch splitting purposes.