x86: use SSE2 to pass float and SIMD types#135408
Conversation
|
r? @SparrowLii rustbot has assigned @SparrowLii. Use |
|
Some changes occurred in compiler/rustc_codegen_gcc These commits modify compiler targets. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1c0655d to
1e6dbb8
Compare
tests/assembly/x86-return-float.rs
Outdated
| //@[sse] needs-llvm-components: x86 | ||
| // We make SSE available but don't use it for the ABI. | ||
| //@[nosse] compile-flags: --target i586-unknown-linux-gnu -Ctarget-feature=+sse2 -Ctarget-cpu=pentium4 | ||
| //@[nosse] needs-llvm-components: x86 |
There was a problem hiding this comment.
Tidy is being silly and doesn't let us set the needs-llvm-components: x86 uniformly for all revisions.
| // FIXME: the MIR opt still works, but the ABI logic now introduces | ||
| // an alloca here. | ||
| // CHECK: alloca | ||
| // CHECK: store <4 x float> %x, ptr %_0, align 16 |
There was a problem hiding this comment.
I have no idea what this was trying to test, but it probably doesn't test that any more. The alloca is emitted by the ABI handling, and this test disables LLVM optimizations, so there's no way we can avoid the alloca.
It seems like this is intended to test mir-opts, but then why is it not a mir-opt test...?
There was a problem hiding this comment.
This is a regression test for an ICE in cg_ssa: d757c4b
There was a problem hiding this comment.
But why would that be a codegen test...?
There was a problem hiding this comment.
Probably the intent was that the alloca didn't come from the transmute -- since transmute used to always just make an alloca and read-write it -- but certainly if the alloca is from the ABI handling it's fine to have it there.
There was a problem hiding this comment.
@scottmcm So should I remove the test then? Or is there some way to still test what actually matters here?
There was a problem hiding this comment.
I ended up removing the test; I can't figure out how to write a test that does not contain alloca. We always emit alloca for the implicit transmutes caused by the ABI, so there's no way to reasonably test for their absence in a test that disables LLVM opts.
@scottmcm is it worth opening an issue to track improving our codegen for the implicit ABI transmutes? It would maybe use some of the same optimizations as explicit transmutes... or maybe not, I have no idea.
tests/codegen/simd/packed-simd.rs
Outdated
| // CHECK-NEXT: store <3 x float> [[VREG]], ptr [[RET_VREG]], [[RET_ALIGN]] | ||
| // CHECK-NEXT: ret void | ||
| // opt3-NEXT: ret <3 x float> [[VREG:%[a-z0-9_]+]] | ||
| // noopt: ret <3 x float> [[VREG:%[a-z0-9_]+]] |
There was a problem hiding this comment.
I have no idea if this test still makes any sense for the "noopt" revision... it seems like for some reason the call to load does not get inlined any more or so?
There was a problem hiding this comment.
OTOH the "noopt" test was already very odd before... the load <3 x float> there referred to loading the return value of load() which was returned into the alloca.
So I made this care only about opt3 for the square_packed_full part of the test.
| @@ -1,4 +1,5 @@ | |||
| //@ revisions:opt3 noopt | |||
| //@ only-x86_64 | |||
There was a problem hiding this comment.
This test is checking our LLVM ABI lowering as much as it is checking anything packed-simd specific, so it will be very hard to make this work uniformly across targets that use a different ABI lowering.
8bfa689 to
d7b63a3
Compare
| // This is a single scalar that fits into an SSE register. | ||
| // FIXME: We cannot return 128-bit-floats this way since scalars larger than | ||
| // 64bit must be returned indirectly to make cranelift happy. See the comment | ||
| // in `adjust_for_rust_abi`. |
There was a problem hiding this comment.
f128 floats are only partially supported by Cranelift, but even so returning them in a vector register should work just fine I think. Returning f128 in a vector register doesn't have the same issue that returning i128 in integer registers has as f128 fits in a single vector register, while i128 doesn't fit in a single integer register.
There was a problem hiding this comment.
The problem is that the way the "return large things indirectly" is implemented is not great, it leads to ICEs if other adjustments have already decided the ABI for one of these return types: make_indirect cannot be called if someone already called cast_to.
IMO this is a backend bug, backends should support all scalar types as return types.
This comment has been minimized.
This comment has been minimized.
e3efac5 to
e082978
Compare
This comment has been minimized.
This comment has been minimized.
|
|
||
| // Pick a target that requires no target features, so that no warning is shown | ||
| // about missing target features. | ||
| //@ compile-flags: --target arm-unknown-linux-gnueabi |
There was a problem hiding this comment.
@bjorn3 does the cranelift backend return anything for target_features_cfg? If not, there might be warnings now about missing target features, depending on the ABI info for the current target.
There was a problem hiding this comment.
Yes, it currently hard codes sse and sse2 for all x86_64 targets that are not bare-metal:
rust/compiler/rustc_codegen_cranelift/src/lib.rs
Lines 179 to 200 in 7bb9888
There was a problem hiding this comment.
tests/ui-fulldeps/codegen-backend/ doesn't actually use cg_clif. It uses the backend in tests/ui-fulldeps/codegen-backend/auxiliary/the_backend.rs. It is fine to implement target_features_cfg there as always returning sse and sse2. It doesn't compile anything to machine code anyway. It is just a test that -Zcodegen-backend with an external codegen backend functions.
There was a problem hiding this comment.
Yeah I get that. But it still seemed easiest to just use a target that doesn't require any features.
This comment has been minimized.
This comment has been minimized.
compiler/rustc_target/src/spec/targets/i686_unknown_linux_gnu.rs
Outdated
Show resolved
Hide resolved
|
(will rebase after review) |
| // x86-sse: define {{(dso_local )?}}<4 x i8> @test_UnionF32F32(float %_1) | ||
| // x86-nosse: define {{(dso_local )?}}i32 @test_UnionF32F32(float %_1) |
There was a problem hiding this comment.
wait, this also uses the byte vector, now that I look? interesting.
| // x86-nosse-LABEL: void @f128_neg({{.*}}sret([16 x i8]) | ||
| // x86-sse-LABEL: <16 x i8> @f128_neg(fp128 |
There was a problem hiding this comment.
huh...
this... feels incorrect? yet I see how it isn't. interesting.
There was a problem hiding this comment.
@tgross35 this is the correct representation, then? as a vector, to guarantee passing in xmm registers?
There was a problem hiding this comment.
From the ABI
Arguments of types __float128, _Decimal128 and __m128 are split into two halves. The least significant ones belong to class SSE, the most significant one to class SSEUP
So passing them in the same way as __m128 seems reasonable.
Not needed here, but these tests should probably be split. I originally added them to verify we do correct lowering for operations (before they were supported enough to run tests without crashing, but still reasonably useful), but didn't really intend to check the calling convention with it. So this file could become target-agnostic and drop all types from LABEL, then add a separate test to check passing and returning with extern "C".
There was a problem hiding this comment.
Our PassMode just lets us say "vector type of this size", we can't distinguish <16 x i8> from <2 x i64> from <2 x f64>. I guess we implicitly assume LLVM passes all of them the same way. So if that's not the case we have a more fundamental problem.
|
r=me after rebase. that's an interesting development in terms of how our codegen looks but it seems correct. |
|
@bors r=workingjubilee p=1 |
|
@bors r- retry |
|
Okay, this should do it. |
This comment has been minimized.
This comment has been minimized.
|
💔 Test failed - checks-actions |
|
@bors try |
This comment has been minimized.
This comment has been minimized.
|
@bors try |
This comment was marked as outdated.
This comment was marked as outdated.
|
☀️ Try build successful - checks-actions |
|
lol LLVM @bors r+ |
|
☀️ Test successful - checks-actions |
|
Finished benchmarking commit (17c1c32): comparison URL. Overall result: ✅ improvements - no action needed@rustbot label: -perf-regression Instruction countThis is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.
Max RSS (memory usage)Results (primary -1.1%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResults (secondary -9.3%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResults (primary -0.0%)This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 775.604s -> 773.157s (-0.32%) |
This builds on the new X86Sse2 ABI landed in #137037 to actually make it a separate ABI from the default x86 ABI, and use SSE2 registers. Specifically, we use it in two ways: to return
f64values in a register rather than by-ptr, and to pass vectors of size up to 128bit in a register (or, well, whatever LLVM does when passing<4 x float>by-val, I don't actually know if this ends up in a register).Cc @workingjubilee
Fixes #133611
try-job: aarch64-apple
try-job: aarch64-gnu
try-job: aarch64-gnu-debug
try-job: test-various
try-job: x86_64-gnu-nopt
try-job: dist-i586-gnu-i586-i686-musl
try-job: x86_64-msvc-1