x86: use SSE2 to pass float and SIMD types by RalfJung · Pull Request #135408 · rust-lang/rust

RalfJung · 2025-01-12T14:10:09Z

This builds on the new X86Sse2 ABI landed in #137037 to actually make it a separate ABI from the default x86 ABI, and use SSE2 registers. Specifically, we use it in two ways: to return f64 values in a register rather than by-ptr, and to pass vectors of size up to 128bit in a register (or, well, whatever LLVM does when passing <4 x float> by-val, I don't actually know if this ends up in a register).

Cc @workingjubilee
Fixes #133611

try-job: aarch64-apple
try-job: aarch64-gnu
try-job: aarch64-gnu-debug
try-job: test-various
try-job: x86_64-gnu-nopt
try-job: dist-i586-gnu-i586-i686-musl
try-job: x86_64-msvc-1

rustbot · 2025-01-12T14:10:18Z

r? @SparrowLii

rustbot has assigned @SparrowLii.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-01-12T14:10:21Z

Some changes occurred in compiler/rustc_codegen_gcc

cc @antoyo, @GuillaumeGomez

These commits modify compiler targets.
(See the Target Tier Policy.)

compiler/rustc_target/src/callconv/mod.rs

RalfJung · 2025-01-12T15:04:17Z

tests/assembly/x86-return-float.rs

+//@[sse] needs-llvm-components: x86
+// We make SSE available but don't use it for the ABI.
+//@[nosse] compile-flags: --target i586-unknown-linux-gnu -Ctarget-feature=+sse2 -Ctarget-cpu=pentium4
+//@[nosse] needs-llvm-components: x86


Tidy is being silly and doesn't let us set the needs-llvm-components: x86 uniformly for all revisions.

RalfJung · 2025-01-12T15:05:52Z

tests/codegen/intrinsics/transmute-x64.rs

+    // FIXME: the MIR opt still works, but the ABI logic now introduces
+    // an alloca here.
+    // CHECK: alloca
+    // CHECK: store <4 x float> %x, ptr %_0, align 16


I have no idea what this was trying to test, but it probably doesn't test that any more. The alloca is emitted by the ABI handling, and this test disables LLVM optimizations, so there's no way we can avoid the alloca.

It seems like this is intended to test mir-opts, but then why is it not a mir-opt test...?

Cc @scottmcm (who added the test in d757c4b)

This is a regression test for an ICE in cg_ssa: d757c4b

But why would that be a codegen test...?

hmm @scottmcm can you explain?

Probably the intent was that the alloca didn't come from the transmute -- since transmute used to always just make an alloca and read-write it -- but certainly if the alloca is from the ABI handling it's fine to have it there.

@scottmcm So should I remove the test then? Or is there some way to still test what actually matters here?

I ended up removing the test; I can't figure out how to write a test that does not contain alloca. We always emit alloca for the implicit transmutes caused by the ABI, so there's no way to reasonably test for their absence in a test that disables LLVM opts.

@scottmcm is it worth opening an issue to track improving our codegen for the implicit ABI transmutes? It would maybe use some of the same optimizations as explicit transmutes... or maybe not, I have no idea.

RalfJung · 2025-01-12T15:06:54Z

tests/codegen/simd/packed-simd.rs

-    // CHECK-NEXT: store <3 x float> [[VREG]], ptr [[RET_VREG]], [[RET_ALIGN]]
-    // CHECK-NEXT: ret void
+    // opt3-NEXT: ret <3 x float> [[VREG:%[a-z0-9_]+]]
+    // noopt: ret <3 x float> [[VREG:%[a-z0-9_]+]]


I have no idea if this test still makes any sense for the "noopt" revision... it seems like for some reason the call to load does not get inlined any more or so?

OTOH the "noopt" test was already very odd before... the load <3 x float> there referred to loading the return value of load() which was returned into the alloca.

So I made this care only about opt3 for the square_packed_full part of the test.

RalfJung · 2025-01-12T15:09:13Z

tests/codegen/simd/packed-simd.rs

@@ -1,4 +1,5 @@
 //@ revisions:opt3 noopt
+//@ only-x86_64


This test is checking our LLVM ABI lowering as much as it is checking anything packed-simd specific, so it will be very hard to make this work uniformly across targets that use a different ABI lowering.

compiler/rustc_target/src/spec/mod.rs

compiler/rustc_target/src/callconv/mod.rs

bjorn3 · 2025-01-12T15:47:31Z

compiler/rustc_target/src/callconv/x86.rs

                // This is a single scalar that fits into an SSE register.
+                // FIXME: We cannot return 128-bit-floats this way since scalars larger than
+                // 64bit must be returned indirectly to make cranelift happy. See the comment
+                // in `adjust_for_rust_abi`.


f128 floats are only partially supported by Cranelift, but even so returning them in a vector register should work just fine I think. Returning f128 in a vector register doesn't have the same issue that returning i128 in integer registers has as f128 fits in a single vector register, while i128 doesn't fit in a single integer register.

The problem is that the way the "return large things indirectly" is implemented is not great, it leads to ICEs if other adjustments have already decided the ABI for one of these return types: make_indirect cannot be called if someone already called cast_to.

IMO this is a backend bug, backends should support all scalar types as return types.

RalfJung · 2025-01-12T17:27:17Z

tests/ui-fulldeps/codegen-backend/hotplug.rs


+// Pick a target that requires no target features, so that no warning is shown
+// about missing target features.
+//@ compile-flags: --target arm-unknown-linux-gnueabi


@bjorn3 does the cranelift backend return anything for target_features_cfg? If not, there might be warnings now about missing target features, depending on the ABI info for the current target.

Yes, it currently hard codes sse and sse2 for all x86_64 targets that are not bare-metal:

rust/compiler/rustc_codegen_cranelift/src/lib.rs

Lines 179 to 200 in 7bb9888

fn target_features_cfg(

&self,

sess: &Session,

_allow_unstable: bool,

) -> Vec<rustc_span::Symbol> {

// FIXME return the actually used target features. this is necessary for #[cfg(target_feature)]

if sess.target.arch == "x86_64" && sess.target.os != "none" {

// x86_64 mandates SSE2 support

vec![sym::fsxr, sym::sse, sym::sse2]

} else if sess.target.arch == "aarch64" {

match &*sess.target.os {

"none" => vec![],

// On macOS the aes, sha2 and sha3 features are enabled by default and ring

// fails to compile on macOS when they are not present.

"macos" => vec![sym::neon, sym::aes, sym::sha2, sym::sha3],

// AArch64 mandates Neon support

_ => vec![sym::neon],

}

} else {

vec![]

}

}

tests/ui-fulldeps/codegen-backend/ doesn't actually use cg_clif. It uses the backend in tests/ui-fulldeps/codegen-backend/auxiliary/the_backend.rs. It is fine to implement target_features_cfg there as always returning sse and sse2. It doesn't compile anything to machine code anyway. It is just a test that -Zcodegen-backend with an external codegen backend functions.

Yeah I get that. But it still seemed easiest to just use a target that doesn't require any features.

compiler/rustc_target/src/spec/targets/i686_unknown_linux_gnu.rs

RalfJung · 2025-02-17T15:03:06Z

(will rebase after review)

workingjubilee · 2025-02-17T21:45:38Z

tests/codegen/union-abi.rs

+// x86-sse: define {{(dso_local )?}}<4 x i8> @test_UnionF32F32(float %_1)
+// x86-nosse: define {{(dso_local )?}}i32 @test_UnionF32F32(float %_1)


wait, this also uses the byte vector, now that I look? interesting.

workingjubilee · 2025-02-17T21:47:36Z

tests/codegen/float/f128.rs

+// x86-nosse-LABEL: void @f128_neg({{.*}}sret([16 x i8])
+// x86-sse-LABEL: <16 x i8> @f128_neg(fp128


huh...

this... feels incorrect? yet I see how it isn't. interesting.

@tgross35 this is the correct representation, then? as a vector, to guarantee passing in xmm registers?

From the ABI

Arguments of types __float128, _Decimal128 and __m128 are split into two halves. The least significant ones belong to class SSE, the most significant one to class SSEUP

So passing them in the same way as __m128 seems reasonable.

Not needed here, but these tests should probably be split. I originally added them to verify we do correct lowering for operations (before they were supported enough to run tests without crashing, but still reasonably useful), but didn't really intend to check the calling convention with it. So this file could become target-agnostic and drop all types from LABEL, then add a separate test to check passing and returning with extern "C".

Our PassMode just lets us say "vector type of this size", we can't distinguish <16 x i8> from <2 x i64> from <2 x f64>. I guess we implicitly assume LLVM passes all of them the same way. So if that's not the case we have a more fundamental problem.

workingjubilee · 2025-02-17T21:53:42Z

r=me after rebase.

that's an interesting development in terms of how our codegen looks but it seems correct.

RalfJung · 2025-02-18T07:37:00Z

@bors r=workingjubilee p=1
(so many conflicts)

bors · 2025-02-18T07:37:02Z

📌 Commit c3ae562 has been approved by workingjubilee

It is now in the queue for this repository.

bors · 2025-02-18T07:43:59Z

⌛ Testing commit c3ae562 with merge b6dc8cf...

RalfJung · 2025-02-18T07:47:58Z

@bors r- retry
There's still something wrong with tests/codegen/abi-x86-sse.rs.

RalfJung · 2025-02-18T07:55:40Z

Okay, this should do it.
@bors r=workingjubilee

bors · 2025-02-18T07:55:42Z

📌 Commit b31637b has been approved by workingjubilee

It is now in the queue for this repository.

bors · 2025-02-18T12:44:14Z

⌛ Testing commit b31637b with merge 1446a51...

bors · 2025-02-18T14:22:49Z

💔 Test failed - checks-actions

RalfJung · 2025-02-18T14:39:35Z

@bors try

bors · 2025-02-18T14:40:48Z

⌛ Trying commit 69f17af with merge 43830f0...

RalfJung · 2025-02-18T15:12:00Z

@bors try

bors · 2025-02-18T15:13:13Z

⌛ Trying commit 803feb5 with merge 0937552...

bors · 2025-02-18T18:11:04Z

☀️ Try build successful - checks-actions
Build commit: 0937552 (09375525af5bce535584f59b98b0e97eaf5c49ed)

workingjubilee · 2025-02-18T18:41:37Z

lol LLVM

@bors r+

bors · 2025-02-18T18:41:40Z

📌 Commit 803feb5 has been approved by workingjubilee

It is now in the queue for this repository.

bors · 2025-02-19T01:25:05Z

⌛ Testing commit 803feb5 with merge 17c1c32...

bors · 2025-02-19T04:36:19Z

☀️ Test successful - checks-actions
Approved by: workingjubilee
Pushing 17c1c32 to master...

rust-timer · 2025-02-19T05:54:01Z

Finished benchmarking commit (17c1c32): comparison URL.

Overall result: ✅ improvements - no action needed

@rustbot label: -perf-regression

Instruction count

This is the most reliable metric that we have; it was used to determine the overall result at the top of this comment. However, even this metric can sometimes exhibit noise.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-0.3%	[-0.3%, -0.3%]	1
All ❌✅ (primary)	-	-	0

Max RSS (memory usage)

Results (primary -1.1%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-1.1%	[-1.1%, -1.1%]	1
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-1.1%	[-1.1%, -1.1%]	1

Cycles

Results (secondary -9.3%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-	-	0
Improvements ✅ (secondary)	-9.3%	[-9.3%, -9.3%]	1
All ❌✅ (primary)	-	-	0

Binary size

Results (primary -0.0%)

This is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.

	mean	range	count
Regressions ❌ (primary)	-	-	0
Regressions ❌ (secondary)	-	-	0
Improvements ✅ (primary)	-0.0%	[-0.0%, -0.0%]	16
Improvements ✅ (secondary)	-	-	0
All ❌✅ (primary)	-0.0%	[-0.0%, -0.0%]	16

Bootstrap: 775.604s -> 773.157s (-0.32%)
Artifact size: 360.33 MiB -> 360.31 MiB (-0.01%)

rustbot assigned SparrowLii Jan 12, 2025

rustbot added O-apple Operating system: Apple / Darwin (macOS, iOS, tvOS, visionOS, watchOS) S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 12, 2025

RalfJung commented Jan 12, 2025

View reviewed changes

compiler/rustc_target/src/callconv/mod.rs Show resolved Hide resolved

This comment has been minimized.

Sign in to view

RalfJung force-pushed the x86-sse2 branch 2 times, most recently from 1c0655d to 1e6dbb8 Compare January 12, 2025 15:03

RalfJung commented Jan 12, 2025

View reviewed changes

Noratrieb reviewed Jan 12, 2025

View reviewed changes

compiler/rustc_target/src/spec/mod.rs Outdated Show resolved Hide resolved

bjorn3 reviewed Jan 12, 2025

View reviewed changes

compiler/rustc_target/src/callconv/mod.rs Outdated Show resolved Hide resolved

bjorn3 reviewed Jan 12, 2025

View reviewed changes

compiler/rustc_target/src/callconv/mod.rs Outdated Show resolved Hide resolved

RalfJung force-pushed the x86-sse2 branch 2 times, most recently from 8bfa689 to d7b63a3 Compare January 12, 2025 15:29

bjorn3 reviewed Jan 12, 2025

View reviewed changes

RalfJung force-pushed the x86-sse2 branch from a52a670 to 2789b51 Compare January 12, 2025 15:54

This comment has been minimized.

Sign in to view

RalfJung force-pushed the x86-sse2 branch 2 times, most recently from e3efac5 to e082978 Compare January 12, 2025 16:33

This comment has been minimized.

Sign in to view

RalfJung force-pushed the x86-sse2 branch from e082978 to 3eb1f47 Compare January 12, 2025 17:24

RalfJung commented Jan 12, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

workingjubilee reviewed Jan 12, 2025

View reviewed changes

compiler/rustc_target/src/spec/targets/i686_unknown_linux_gnu.rs Outdated Show resolved Hide resolved

workingjubilee reviewed Feb 17, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

x86-sse2 ABI: use SSE registers for floats and SIMD

803feb5

This comment was marked as outdated.

Sign in to view

moxian mentioned this pull request Mar 27, 2025

pclmulqdq intrinsics don't inline well across target_feature changes anymore #139029

Open

BoxyUwU mentioned this pull request Apr 21, 2025

Draft release notes for 1.87 #140133

Closed

RalfJung mentioned this pull request May 20, 2025

x86 (32/64): go back to passing SIMD vectors by-ptr #141309

Merged

tgross35 mentioned this pull request Jun 17, 2025

32x performance regression for AVX2 intrinsics in Rust v1.87 #142603

Closed

roryharr mentioned this pull request Nov 4, 2025

PoH Performance degradation with rust 1.87+ anza-xyz/agave#8869

Closed

	fn target_features_cfg(
	&self,
	sess: &Session,
	_allow_unstable: bool,
	) -> Vec<rustc_span::Symbol> {
	// FIXME return the actually used target features. this is necessary for #[cfg(target_feature)]
	if sess.target.arch == "x86_64" && sess.target.os != "none" {
	// x86_64 mandates SSE2 support
	vec![sym::fsxr, sym::sse, sym::sse2]
	} else if sess.target.arch == "aarch64" {
	match &*sess.target.os {
	"none" => vec![],
	// On macOS the aes, sha2 and sha3 features are enabled by default and ring
	// fails to compile on macOS when they are not present.
	"macos" => vec![sym::neon, sym::aes, sym::sha2, sym::sha3],
	// AArch64 mandates Neon support
	_ => vec![sym::neon],
	}
	} else {
	vec![]
	}
	}

		// x86-sse: define {{(dso_local )?}}<4 x i8> @test_UnionF32F32(float %_1)
		// x86-nosse: define {{(dso_local )?}}i32 @test_UnionF32F32(float %_1)

		// x86-nosse-LABEL: void @f128_neg({{.*}}sret([16 x i8])
		// x86-sse-LABEL: <16 x i8> @f128_neg(fp128

Uh oh!

Conversation

RalfJung commented Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rustbot commented Jan 12, 2025

Uh oh!

rustbot commented Jan 12, 2025

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

RalfJung Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bjorn3 Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RalfJung Jan 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

RalfJung commented Feb 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tgross35 Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

workingjubilee commented Feb 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

RalfJung commented Jan 12, 2025 •

edited

Loading

RalfJung Jan 12, 2025 •

edited

Loading

RalfJung Jan 12, 2025 •

edited

Loading

RalfJung Jan 12, 2025 •

edited

Loading

bjorn3 Jan 12, 2025 •

edited

Loading

RalfJung Jan 12, 2025 •

edited

Loading

tgross35 Feb 17, 2025 •

edited

Loading

workingjubilee commented Feb 17, 2025 •

edited

Loading