For this use-case, you can squeeze out even more performance by using the SHA-1 implementation in Intel ISA-L Crypto [1]. The SHA-1 implementation there allows for multi-buffer hashes, giving you the ability to calculate the hashes for multiple chunks in parallel on a single core. Given that that is basically your usecase, it might be worth considering. I doubt it'll provide much speedup if you're already I/O bound here though.
I came across this repo recently and it looks great. It's a pity there doesn't seem to be an official Ubuntu package for it though. There is one for the Intelligent Storage Acceleration Library though.
SHA1 is difficult to vectorize due to a tight loop-carried dependency in the main operation. In an optimized build, I've only seen about a 15% speedup over the scalar version with x64 SSSE3 without hardware SHA1 support. Debug builds of course can benefit more from the reduction in operations since the inefficient code generation is a bigger issue there than the dependency chains. I think the performance delta is bigger for ARM64 CPUs, but it's pretty rare to not have the Crypto extension (except notably some Raspberry Pi models).
The comments in the SSE2 version are a bit odd as it references MMX, and the Pentium M and Efficeon CPUs. Those CPUs are ancient -- 2003/2004 era. The vectorized code you have also uses SSE2 and not MMX, which is important since SSE2 is double the width and has different performance characteristics from MMX. IIRC, Intel CPUs didn't start supporting SHA until ~2019 with Ice Lake, so the target for non-hardware-accelerated vectorized SHA1 for Intel CPUs would be mostly Skylake-based.
Yes, that's the obvious (and boring!) answer, that I mention in the introduction and that's in a way the implicit conclusion. But that does not teach us SIMD then :)
Your article isn't really about, though, how to speed up a debug build, and I thereby think you're likely not going to find the right audience. Like, to be honest, I gave up on your article, because while I found the premise of speeding up a debug build really interesting, I (currently) have no interest in hand-optimizing SIMD... but, in another time, or if I were someone else, I might find that really interesting, but then would not have thought to look at this article. "Hand-optimizing SHA-1 using SIMD intrinsics and assembly" is just a very different mental space than "making my debug build run 100x faster", even if they are two ways to describe the same activity. "Using SIMD and assembly to avoid relying on compiler optimizations for performance" also feels better? I would at least get it if your title was a pun or a joke or was in some way fun--at which point I would blame Hacker News for pulling articles out of their context and not having a good policy surrounding publicly facing titles or subtitles--but it feels like, in this case, the title is merely a poor way to describe the content.
If so that's perfectly fine, but I still agree with saurik--the title is rather misleading. The article is mainly about how to speed up SHA (without using compiler optimizations).
I am not sure I understand this and I believe it's wrong and misleading unless I am missing something obvious. Why would it be the case that hand-written SIMD would perform worse than scalar and non-autovectorized code in debug builds?
I’m not sure why compilers generate slow code for SIMD intrinsics when optimizations are disabled. But, it is observable that they do.
Aras and pretty much all gamedevs are concerned about this because they use SIMD in critical loops and debug build performance is a major concern in gamedev.
Debug build performance has been an issue forever for a variety of reasons. The most common solution is to keep certain critical systems optimized even in debug build. And, only disable optimizations for those individual subsystems when they are the specific target of debugging. It’s inconvenient for game engines devs. But, that’s a small subset of the engineering team.
These AI slop accusations are getting redic. Is the problem that I was too through in my response? :P Comes from decades of explaining bespoke tech to artists.
And, the writer of the "wrong and misleading" article I linked was the Lead Graphics Programmer of Unity3D for 15 years! XD
Well, the response really sounded unnatural and llm-ey. If it's not then please take my apology.
I write SIMD kernels and the conclusion drawn in the article makes no sense regardless of the fact who wrote it. I don't doubt the observations made in experiments but the hypothesis that the SIMD is slowing down the code.
The actual answer is in the disassembly but unfortunately it wasn't shown.
I understand the post is about learning to speed up SHA1 calculation, that I have no comment. However, the state file is a solved problem for me. It's a rare case where state files are corrupted and it's simple to just re-check the file. I cannot imagine a torrent client checking the hash of TBs of files for every single start. It's not a coincidence that many torrent clients have a feature to skip hash checking and just immediately assume the file is correct and start seeding immediately.
If I were a betting person I'd bet that the sha1 instructions and the openssl instruction map to similar enough uops. Unsure if there's a way to check, but that's my understanding of the thousands of instructions in modern processors - mostly just assigning names to common patterns.
For this use-case, you can squeeze out even more performance by using the SHA-1 implementation in Intel ISA-L Crypto [1]. The SHA-1 implementation there allows for multi-buffer hashes, giving you the ability to calculate the hashes for multiple chunks in parallel on a single core. Given that that is basically your usecase, it might be worth considering. I doubt it'll provide much speedup if you're already I/O bound here though.
[1]: https://github.com/intel/isa-l_crypto
Thank you, I will definitely have a look, and update the article if there’s any interesting finding
I came across this repo recently and it looks great. It's a pity there doesn't seem to be an official Ubuntu package for it though. There is one for the Intelligent Storage Acceleration Library though.
SHA1 is difficult to vectorize due to a tight loop-carried dependency in the main operation. In an optimized build, I've only seen about a 15% speedup over the scalar version with x64 SSSE3 without hardware SHA1 support. Debug builds of course can benefit more from the reduction in operations since the inefficient code generation is a bigger issue there than the dependency chains. I think the performance delta is bigger for ARM64 CPUs, but it's pretty rare to not have the Crypto extension (except notably some Raspberry Pi models).
The comments in the SSE2 version are a bit odd as it references MMX, and the Pentium M and Efficeon CPUs. Those CPUs are ancient -- 2003/2004 era. The vectorized code you have also uses SSE2 and not MMX, which is important since SSE2 is double the width and has different performance characteristics from MMX. IIRC, Intel CPUs didn't start supporting SHA until ~2019 with Ice Lake, so the target for non-hardware-accelerated vectorized SHA1 for Intel CPUs would be mostly Skylake-based.
Why not just compile that particular object with optimizations on and the rest of the file with optimizations off?
Yes, that's the obvious (and boring!) answer, that I mention in the introduction and that's in a way the implicit conclusion. But that does not teach us SIMD then :)
Your article isn't really about, though, how to speed up a debug build, and I thereby think you're likely not going to find the right audience. Like, to be honest, I gave up on your article, because while I found the premise of speeding up a debug build really interesting, I (currently) have no interest in hand-optimizing SIMD... but, in another time, or if I were someone else, I might find that really interesting, but then would not have thought to look at this article. "Hand-optimizing SHA-1 using SIMD intrinsics and assembly" is just a very different mental space than "making my debug build run 100x faster", even if they are two ways to describe the same activity. "Using SIMD and assembly to avoid relying on compiler optimizations for performance" also feels better? I would at least get it if your title was a pun or a joke or was in some way fun--at which point I would blame Hacker News for pulling articles out of their context and not having a good policy surrounding publicly facing titles or subtitles--but it feels like, in this case, the title is merely a poor way to describe the content.
Could be that he did it for fun and not to reach a target audience?
If so that's perfectly fine, but I still agree with saurik--the title is rather misleading. The article is mainly about how to speed up SHA (without using compiler optimizations).
You would appreciate https://aras-p.info/blog/2024/09/14/Vector-math-library-code...
TLDR: SIMD intrinsics are great. But, their performance in debug builds is surprisingly bad.
I am not sure I understand this and I believe it's wrong and misleading unless I am missing something obvious. Why would it be the case that hand-written SIMD would perform worse than scalar and non-autovectorized code in debug builds?
I’m not sure why compilers generate slow code for SIMD intrinsics when optimizations are disabled. But, it is observable that they do.
Aras and pretty much all gamedevs are concerned about this because they use SIMD in critical loops and debug build performance is a major concern in gamedev.
Debug build performance has been an issue forever for a variety of reasons. The most common solution is to keep certain critical systems optimized even in debug build. And, only disable optimizations for those individual subsystems when they are the specific target of debugging. It’s inconvenient for game engines devs. But, that’s a small subset of the engineering team.
Why would you slap an LLM response to my question?
Dude... I'm a greybeard game engine developer. I write game engine critical loops using SIMD intrinsics. https://old.reddit.com/r/gamedev/comments/xddlp/describe_wha...
These AI slop accusations are getting redic. Is the problem that I was too through in my response? :P Comes from decades of explaining bespoke tech to artists.
And, the writer of the "wrong and misleading" article I linked was the Lead Graphics Programmer of Unity3D for 15 years! XD
Well, the response really sounded unnatural and llm-ey. If it's not then please take my apology.
I write SIMD kernels and the conclusion drawn in the article makes no sense regardless of the fact who wrote it. I don't doubt the observations made in experiments but the hypothesis that the SIMD is slowing down the code.
The actual answer is in the disassembly but unfortunately it wasn't shown.
Yep, I was hoping to learn how to do this. Seems like a much better long term lesson.
For gcc: #pragma GCC optimize ("O0")
For clang: #pragma clang optimize off
For MSVC: #pragma optimize("", off)
Put one of these at the top of your source file.
Generally people want the opposite here: raising the optimization level for some code rather than turning it off for everything.
I understand the post is about learning to speed up SHA1 calculation, that I have no comment. However, the state file is a solved problem for me. It's a rare case where state files are corrupted and it's simple to just re-check the file. I cannot imagine a torrent client checking the hash of TBs of files for every single start. It's not a coincidence that many torrent clients have a feature to skip hash checking and just immediately assume the file is correct and start seeding immediately.
If I were a betting person I'd bet that the sha1 instructions and the openssl instruction map to similar enough uops. Unsure if there's a way to check, but that's my understanding of the thousands of instructions in modern processors - mostly just assigning names to common patterns.