Butting in and giving a slightly deeper explanation because many ppl understandably don't understand why x86 can't just copy M1 here:

On ARM every instruction is 4bytes long, nice consistent and easy. Decode an instruction every 4 bytes, they should all be valid instructions
1/? https://twitter.com/JamesDSneed/status/1346090236720394240
In x86 instructions can be 1-15bytes long. The only way x86 can push throughput beyond ~4 instructions/cycle is by attempting to decode an instruction every single byte over a 16 byte window, and invalidating decode attempts that were overlapped by earlier instructions.
2/?
To be fair 1 x86 instruction can get more work done then 1 ARM instruction so direct mapping between instruction decode and performance isn't possible.

Although on that note the ~4ish instructions/cycle I mentioned for x86 really only applies to simple, short instructions.
3/?
The 8way decode on the M1 isn't easily achieved on x86.

Trying to get there gets really complex, and with complexity comes increased power draw... and usually lower frequencies, which could mitigate gains from going wider.
4/?
x86 carries a lot of ugly baggage more modern ISA's do not, 40yrs tends to teach designers alot about what not to do in ISA design.

This thread is now going to get a bit afield from the original point but my (very brief) thoughts here as far trying to push to wider decode:
5/?
x86 designs certainly have to get more ambitious with uOp caching to scale throughput higher.
I could envision an evolved form of the dual decode engine used in Intel's Tremont as part of a path forward. Tremont dual 3wide decoders work independently on both sides of a branch
6/7
but a variant of Tremont's decoders that could work on the same instruction stream using instruction length tagging in L1 icache makes sense to me.

I really wish we knew more about Tremont. It's probably more interesting as a look at the future then Zen3/Willow Cove is imo.
7/7
Iif anyone replies please do not turn this into an Intel vs AMD argument. AMD and Intel are both working within the constraints of x86, none of this is meant as part of the generic AMD/Intel is best arguments and I'm not interested in that.
You can follow @ChaoticLife13.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.