Migrate to new rand_core utility functions#1686
Conversation
|
I migrated As expected, the difference looks mostly within the noise threshold. The smallest block outlier is probably just a consequence of the sloppy measurement setup (x86-64 laptop without disabled frequency scaling). |
|
Thanks. This needs to be compared using the following diff against master: $ jd -r wq --git
diff --git a/Cargo.toml b/Cargo.toml
index 13aea84ed7..0064cb9a56 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -76,7 +76,8 @@
rand_core = { version = "0.10.0-rc-2", default-features = false }
log = { version = "0.4.4", optional = true }
serde = { version = "1.0.103", features = ["derive"], optional = true }
-chacha20 = { version = "=0.10.0-rc.5", default-features = false, features = ["rng"], optional = true }
+# chacha20 = { version = "=0.10.0-rc.5", default-features = false, features = ["rng"], optional = true }
+chacha20 = { path = "rand_chacha", optional = true, package = "rand_chacha" }
getrandom = { version = "0.3.0", optional = true }
[dev-dependencies]Running benches now. BTW you only included |
|
Some more results: Edit: updated. This was run with CPU frequency pinned to 577 MHz but still shows bad variance, so take with a big pinch of salt. |
Done. I updated the results in my previous comment. Interestingly, the result for |
|
Updated, but variance is still high. You might want to run these benches yourself. |
|
Results for Ryzen 7 2700x: |
|
Results for M4 Mac: |
|
I've been attempting to get more consistent results: https://gist.github.com/dhardy/514742f635df81053a4c2b95c32004da Can you benchmark |
|
My last run (this PR vs master over Benchmark results
Summary:
|
|
I see a 50% slower performance for I don't think we should bother with sub-3% differences. Without a careful setup I get such difference for different benchmark runs on the same code. |
Unsurprising; it's the cost of an extra check for each output. The question is whether the CPU can run this with minimal overhead; it appears that Zen 3 has lower overhead than M4 but neither is able to eliminate it. Given how few cycles are typically required to generate a word, an extra check can be significant. This is why I'm interested in rust-random/rand_core#26.
No, I consider that insignificant (or barely significant if many benches show results skewed the same way). |
Ah, I see. The old code performs the reseeding check only when cached block is exhausted. While in this PR we pay the (easily predictable) branch and decrement cost on every I am still hesitant to keep the block traits just for |
dhardy
left a comment
There was a problem hiding this comment.
Some comments on the impact of rust-random/rand_core#24 on block RNG implementations.
Summary: not enormously significant, but overall slightly negative in my (subjective) opinion (depending on one's view of the Results: Default requirement).
| const BLOCK_WORDS: u8 = 16; | ||
|
|
||
| #[repr(transparent)] | ||
| pub struct Array64<T>([T; 64]); |
There was a problem hiding this comment.
This was needed because of the bound Results: Default and [_; 64] not implementing this.
It's a nice but ultimately unimportant improvement of the new utility fns.
| impl $ChaChaXRng { | ||
| fn buffer_index(&self) -> u32 { | ||
| self.buffer[0] | ||
| } | ||
|
|
||
| fn generate_and_set(&mut self, index: usize) { | ||
| assert!(index < self.buffer.len()); | ||
| self.buffer[0] = if index != 0 { | ||
| self.core.next_block(&mut self.buffer); | ||
| index as u32 | ||
| } else { | ||
| self.buffer.len() as u32 | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
These fns are needed to support get/set word-pos fns. They were provided by BlockRng.
There was a problem hiding this comment.
These feel like low-level details. They very much depend on using buffer[0] as an index, which isn't something properly documented anywhere.
We could fix this with a Buffer type, but this is undesirable. We could perhaps fix this with a Buffer trait, though it wouldn't be well documented.
| fn from_seed(seed: Self::Seed) -> Self { | ||
| let core = $ChaChaXCore::from_seed(seed); | ||
| Self { | ||
| rng: BlockRng::new(core), | ||
| core: $ChaChaXCore::from_seed(seed), | ||
| buffer: le::new_buffer(), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| impl RngCore for $ChaChaXRng { | ||
| #[inline] | ||
| fn next_u32(&mut self) -> u32 { | ||
| self.rng.next_u32() | ||
| let Self { core, buffer } = self; | ||
| le::next_word_via_gen_block(buffer, |block| core.next_block(block)) | ||
| } | ||
|
|
||
| #[inline] | ||
| fn next_u64(&mut self) -> u64 { | ||
| self.rng.next_u64() | ||
| let Self { core, buffer } = self; | ||
| le::next_u64_via_gen_block(buffer, |block| core.next_block(block)) | ||
| } | ||
|
|
||
| #[inline] | ||
| fn fill_bytes(&mut self, bytes: &mut [u8]) { | ||
| self.rng.fill_bytes(bytes) | ||
| fn fill_bytes(&mut self, dst: &mut [u8]) { | ||
| let Self { core, buffer } = self; | ||
| le::fill_bytes_via_gen_block(dst, buffer, |block| core.next_block(block)); | ||
| } |
There was a problem hiding this comment.
This stuff all requires marginally lower-level implementations; not very significant.
|
With the updated implementation of There is still a small performance difference when compared against |
|
Why have you updated |
|
Because IMO we should discourage it's use. Users arguably should just use a good PRNGs seeded just once. |
|
Closing based on discussion in rust-random/rand_core#31. |
| // Number of generated bytes after which to reseed `ThreadRng`. | ||
| // According to benchmarks, reseeding has a noticeable impact with thresholds | ||
| // of 32 kB and less. We choose 64 kB to avoid significant overhead. | ||
| const THREAD_RNG_RESEED_THRESHOLD: u64 = 1024 * 64; | ||
| /// Number of generated ChaCha blocks (64 bytes) after which we reseed `ThreadRng`. | ||
| /// | ||
| /// According to benchmarks, reseeding has a noticeable impact with thresholds | ||
| /// of 32 KiB and less. We choose 64 KiB (64 * 1024) to avoid significant overhead. | ||
| const THREAD_RNG_RESEED_THRESHOLD: u64 = 1024; |
There was a problem hiding this comment.
I believe this is wrong: four 16-word blocks (64 words total) are generated at once, but block_pos still refers to the 16-word block counter, hence this should be 4096.
There was a problem hiding this comment.
No, ChaCha generates 64 byte blocks. The RNG implementation generates [u32; 64], i.e. 4 blocks or 256 bytes.
There was a problem hiding this comment.
I know that. But the block counter gets incremented by 4 every time blocks are generated: https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/backends/soft.rs#L51
There was a problem hiding this comment.
You use THREAD_RNG_RESEED_THRESHOLD to compare against value returned by block_pos, which tracks number of generated stream cipher blocks, not how many times the Generate::generate method was called.
There was a problem hiding this comment.
block_pos still refers to the 16-word block counter, hence this should be 4096
We declare that reseeding happens after 64 KiB, not after 64K words.
There was a problem hiding this comment.
You use
THREAD_RNG_RESEED_THRESHOLDto compare against value returned byblock_pos, which tracks number of generated stream cipher blocks, not how many times theGenerate::generatemethod was called.
But it's still this counter, right?
We declare that reseeding happens after 64 KiB, not after 64K words.
Aha; 4-byte-words is the factor I was missing.
There was a problem hiding this comment.
But it's still this counter, right?
What do you mean by "this counter"? Both THREAD_RNG_RESEED_THRESHOLD and get_block_pos work with the same "unit", i.e. number of generated stream cipher blocks.
Depends on rust-random/rand_core#24