It’s been nearly 10 years since Arm had first announced the Armv8 architecture in October 2011, and it’s been a quite eventful decade of computing as the instruction set architecture saw increased adoption through the mobile space to the server space, and now starting to become common in the consumer devices market such as laptops and upcoming desktop machines. Throughout the years, Arm has evolved the ISA with various updates and extensions to the architecture, some important, some maybe glanced over easily.

Today, as part of Arm’s Vision Day event, the company is announcing the first details of the company’s new Armv9 architecture, setting the foundation for what Arm hopes to be the computing platform for the next 300 billion chips in the next decade.

The big question that readers will likely be asking themselves is what exactly differentiates Armv9 to Armv8 to warrant such a large jump in the ISA nomenclature. Truthfully, from a purely ISA standpoint, v9 probably isn’t an as fundamental jump as v8 was over v7, which had introduced a completely different execution mode and instruction set with AArch64, which had larger microarchitectural ramifications over AArch32 such as extended registers, 64-bit virtual address spaces and many more improvements.

Armv9 continues the usage of AArch64 as the baseline instruction set, however adds in a few very important extensions in its capabilities that warrants an increment in the architecture numbering, and probably allows Arm to also achieve a sort of software re-baselining of not only the new v9 features, but also the various v8 extensions we’ve seen released over the years.

The three new main pillars of Armv9 that Arm sees as the main goals of the new architecture are security, AI, and improved vector and DSP capabilities. Security is a very big topic for v9 and we’ll go into the new details of the new extensions and features into more depth in a bit, but getting DSP and AI features out of the way first should be straightforward.

Probably the biggest new feature that is promised with new Armv9 compatible CPUs that will be immediately visible to developers and users is the baselining of SVE2 as a successor to NEON.

Scalable Vector Extensions, or SVE, in its first implementation was announced back in 2016 and implemented for the first time in Fujitsu’s A64FX CPU cores, now powering the world’s #1 supercomputer Fukagu in Japan. The problem with SVE was that this first iteration of the new variable vector length SIMD instruction set was rather limited in scope, and aimed more at HPC workloads, missing many of the more versatile instructions which still were covered by NEON.

SVE2 was announced back in April 2019, and looked to solve this issue by complementing the new scalable SIMD instruction set with the needed instructions to serve more varied DSP-like workloads that currently still use NEON.

The benefit of SVE and SVE2 beyond addition various modern SIMD capabilities is in their variable vector size, ranging from 128b to 2048b, allowing variable 128b granularity of vectors, irrespective of what the actual hardware is running on. Purely from a view of vector processing and programming, it means that a software developer would only ever have to compile his code once, and if in the future a CPU would come out with say native 512b SIMD execution pipelines, the code would be able to already take advantage of the full width of the units. Similarly, the same code would be able to run on more conservative designs with a lower hardware execution width capability, which is important to Arm as they design CPUs from IoT, to mobile, to datacentres. It also does this all whilst remaining within the 32b encoding space of the Arm architecture, whereas alternative implementations such as on x86 have to add on new extensions and instructions depending on vector size.

Machine learning is also seen as an important part of Armv9 as Arm sees more and more ML workloads to become common place in the next years. Running ML workloads on dedicated accelerators naturally will still be a requirement for anything that is performance or power efficiency critical, however there still will be vast new adoption of smaller scope ML workloads that will run on CPUs.

Matrix multiplication instructions are key here and will represent an important step in seeing larger adoption across the ecosystem as being a baseline feature of v9 CPUs.

Generally, I see SVE2 as probably the most important factor that would warrant the jump to a v9 nomenclature as it’s a more definitive ISA feature that differentiates it from v8 CPUs in every-day usage, and that would warrant the software ecosystem to go and actually diverge from the existing v8 stack. That’s actually become quite a problem for Arm in the server space as the software ecosystem is still baselining software packages on v8.0, which unfortunately is missing the all-important v8.1 Large System Extensions.

Having the whole software ecosystem move forward and being able to assume new v9 hardware has the capability of the new architectural extensions would help push things ahead, and probably solve some of the current situation.

However v9 isn’t only about SVE2 and new instructions, it also has a very large focus on security, where we’ll be seeing some more radical changes.

Introducing the Confidential Compute Architecture
POST A COMMENT

74 Comments

View All Comments

  • mdriftmeyer - Thursday, April 1, 2021 - link

    You're deluded. The amount of Work in Clang on C/C++ should be clear these are the foundational languages. Apple made the mistake of listening to Lattner before he bailed and developed Swift. If they're smart the fully modernize ObjC and turn Swift into a training language. Reply
  • name99 - Tuesday, March 30, 2021 - link

    "The benefit of SVE and SVE2 beyond addition various modern SIMD capabilities is in their variable vector size, ranging from 128b to 2048b, allowing variable 128b granularity of vectors, irrespective of what the actual hardware is running on"

    Not you too, Andrei :-(
    This is WRONG! This increased width is a minor benefit outside a few specialized use cases. If you want to process 512 bits of vector data per cycle, you can do that today on an A14/M1 (4 wide NEON).
    The primary value of SVE/2 is the introduction of new types of instructions that are a much better match to compilers, and to non-regular algorithms.
    Variable width matters, but NOT in the sense that I can build a 128 bit or a 512 bit implementation; it matters in that I can write a single loop (without prologue or epilogue) targeting an arbitrary width array, without expensive overhead. Along with variable-width adjacent functionality like predicate and scatter/gather.
    Reply
  • eastcoast_pete - Tuesday, March 30, 2021 - link

    Could these properties of SVE2 make Armv9 designs more attractive for, for example, AV1 (and, to come AV2) video encoding?
    I could see some customer interest there, at least if hosted by AWS or Azure.
    Reply
  • name99 - Tuesday, March 30, 2021 - link

    I don't know what's involved with AV1 and AV2 encoding. With the older codecs that I do know, most of the encoding algorithm is in fact extremely regular, so there's limited win from providing support for non-regularity.
    My point is that SVE/2 is primarily a win for types of code that, today, do not benefit much from vectors. It's much less of a win for code that's already strongly improved by vectors.
    Reply
  • Andrei Frumusanu - Tuesday, March 30, 2021 - link

    That's literally what I meant, in a just less explicit way. Reply
  • Over17 - Thursday, April 8, 2021 - link

    What is the "expensive overhead" in question? If you're writing a loop which processes 4 floats, then the tail is no longer than 3 floats. Even if you manually unroll it 4x, then it's 15 floats max to process in the epilogue. For SIMD to give any benefit, you should be processing large amounts of data so even 15 scalar operations is nothing comparing to the main loop.

    If we're talking about the size of code, then it's true; the predicates in SVE2 are making the code look smaller. So the overhead is more about the maintenance costs, isn't it?
    Reply
  • eastcoast_pete - Tuesday, March 30, 2021 - link

    Maybe it's just me, but did anyone else notice that Microsoft was prominently mentioned in several slides in ARM's presentation? To me, it means that both companies are very serious about Windows on ARM, including on the server side. I guess we'll see soon enough if the custom-ARM processor MS is apparently working on has Armv9 baked into it already. I would be surprised if it doesn't.

    And, here my standard complaint about the LITTLE cores: Quo usque tandem, ARM? When will we see a LITTLE core design with out-of-order execution, so that stock ARM designs aren't 2-3 times worse anymore on Perf/W vs Apple's LITTLE cores. That does matter for smartphones, because staying on the LITTLE cores longer and more often improves battery longevity. I know today was about the next big ISA, but some mentioning of "we're working on it" would have been nice.
    Reply
  • SarahKerrigan - Tuesday, March 30, 2021 - link

    Wouldn't surprise me if the ARMv9 little core looks more like the A65 - narrow OoO. Reply
  • BillBear - Tuesday, March 30, 2021 - link

    I wonder if the Matrix Multiplication implementation currently shipping on Apple's M1 chip is just an early implementation of this new spec?

    https://medium.com/swlh/apples-m1-secret-coprocess...

    Apple certainly had an ARM v8 implementation shipping well in advance of anyone else.
    Reply
  • name99 - Thursday, April 1, 2021 - link

    Seems unlikely.
    The ARMv8.6 matrix multiply instructions use the NEON or SVE registers
    https://community.arm.com/developer/ip-products/pr...
    and so can provide limited speedup;
    the Apple scheme uses three totally new (and HUGE) registers, the X, Y, and Z registers. It runs within the CPU but "parallel" to the CPU, there are interlocks to ensure that the matrix instructions are correctly sequenced relative to the non-matrix instructions, but overall the matrix instructions run like an old-style (80s or so) coprocessor, not like part of the SIMD/fp unit.

    The Apple scheme feels very much like they appreciate it's a stop-gap solution, an experiment that's being evolved. As such they REALLY don't want you to code directly to it, because I expect they will be modifying it (changing register sizes, register layout, etc) every year, and they can hide that behind API calls, but don't want to have to deal with legacy code that uses direct AMX instructions.
    Reply

Log in

Don't have an account? Sign up now