CPU Chip Topologies

Having explained that the various cloud instances can vary quite dramatically in their hardware configurations even though on paper they have the same delivered “vCPU” count, it would be interesting to dwell a little bit more into the CPU topologies and resulting aspects such as the core-to-core latencies. Frustrated with some inflexible or inaccurate public tools on the matter, I recently had the time to write a new custom microbenchmark for testing synchronisation latencies of CPU cores, exhibiting some of the cache-coherency as well as physical layouts of current designs.

The following tables are core-to-core synchronisation latencies in nanoseconds.

Graviton1 - a1.4xlarge - Arm v8.0 ldaex/stlex

I thought that first it would be interesting to go back and showcase how Amazon’s first-generation Graviton SoC fared in this regard. Powered by Cortex-A72 cores, this design had its 16 cores arranged into 4 clusters, connected via coherent crossbar interconnect. Each cluster of 4x A72 cores had its own 2MB L2 cache, and we clearly see the faster access latencies within such a cluster in the above results. Coherency going from one cluster to the other incurred quite a high penalty, almost quadrupling the access latency.

Graviton2 - m6g.16xlarge - Arm v8.0 ldaex/stlex
Cache line Near Core 0

Moving onto the Graviton2 with its 64 N1 cores, we see quite a different setup that is comparatively a lot more uniform. This makes sense as the chip’s cores are connected via a mesh network, and the chip is a single monolithic die design. We do see some odd results in the latencies though, most particularly with the lower numbered cores having seemingly better access latencies between each other than the higher numbered cores. I didn’t quite understand this behaviour as in theory the latencies should behave more evenly across the mesh.

Experimenting around, I saw that the results weren’t consistent across runs. Changing the CPU affinity around did have some larger impact on the results, until I understood what was happening.

Graviton2 - m6g.16xlarge - Arm v8.0 ldaex/stlex
Cache line Near Core 63

In fact, what we’re seeing in the two above result sets isn’t the core-to-core two-point latency across the mesh, but rather the core-to-cache-to-core three-point latencies of the system. Amazon and Arm had confirmed one particular odd aspect of the CMN-600: cache lines are statically resident across the mesh’s cache slices. When allocating a cache line in the memory address space, this undergoes a particular hashing function which determines on which mesh cache slice it is physically homed in. When accessing this cache line from any CPU in the system, it always accesses the same physical L3 cache slice on the chip.

My particular test here consists of three primary threads: the main timing thread, and two ping-pong’ing bouncer threads which alter a flag on a cache-line allocated at the start of the test on the main thread. The behaviour which ensues in the above results table, is that when the flag cache line resides in one corner of the chip (Core pair 54 & 55 to be exact in this results set), and we have two cores on the opposite side of the chip accessing this cache line, it means the data must move forth across the whole chip and back even though it’s only being used by two cores adjacent to each other. Similarly, if two cores are synchronising on a cache line that is homed on a physically near cache slice in the mesh, the latencies will be significantly better as it has to travel much less distance.

It’s quite the odd behaviour that we don’t see quite as prevalent in other meshed cache interconnect systems such as Intel’s Xeons, although admittedly we don’t know if that’s just because we haven’t seen the same large core count implemented in such a manner, or if they have some sort of smarter cache line handling in such scenarios.

Amazon’s Graviton2 doesn’t partition the L3 mesh cache when launching lower vCPU count instances, and each instance in the system competitively shares the full 32MB cache with each other. I didn’t further spend time on this theory in my testing time, but it would be quite the odd result if you’re running some sort of high-synchronisation bottlenecked workload and the performance would vary just depending on where your shared cache line ends up on the chip relative to your active instance cores, food for thought for some database workloads.

The above results were core-to-core latencies implemented via Arm v8.0 exclusive load and stores, which isn’t really how you should do things on new chips such as the Neoverse N1 based Graviton2. As the N1 cores are v8.2 compatible, it means that they also implement the v8.1 ISA which add new instructions such as atomic compare-and-set (CAS).

Graviton2 - m6g.16xlarge - Arm v8.1 CAS

Using CAS instructions for synchronisation is massively more efficient and faster than using several sequential operations for altering a value, as the change can be done in a single atomic operation, vastly reducing the back-and-forth coherency traffic across the chip. In absolute figures, this results in latencies almost 4 times faster than the equivalent Arm v8.0 mechanism.

What’s important here in terms of takeaways, is that if you are deploying software on the new Graviton2 system and upcoming other Neoverse N1 platforms, is that you should pay very close attention to make sure your software stack is compiled against Arm v8.1 or higher (N1 is v8.2).

Currently I don’t know how many pre-compiled packages in Linux distributions even account for such scenarios as I imagine they’re all targeting the most common v8.0 baseline. Again, database applications and other synchronization heavy workloads will be most affected and it’s recommended you compile these from source – it would be an interesting performance test for AWS users who do deploy such workloads (Taking suggestions on such a real-world test setup!).

Looking at the CAS results again, besides the lower latencies, we also see a higher-level latency pattern across the mesh, with distinct separation into 16-core groups in terms of the latencies.

Essentially what I think we’re seeing here is the separation of the chip’s mesh network into “super tiles”, essentially mesh quadrants on the chip. Depending on how the data is routed on the mesh, this appears as a pattern across the latencies between the cores. It is quite odd that these results aren’t as uniform as the v8.0 results; I don’t have any technical explanation for this behaviour.

Finally, it’s again noteworthy that we’re talking about a 64-core system with a single uniform memory domain and on a single monolithic die. AMD’s Rome does provide the former, but doesn’t achieve the same uniform access latencies as the Graviton2.

Xeon Platinum 8259 (Cascade Lake) - m5n.16xlarge - x86 CAS

Looking at the Intel Xeon based instances for comparison, we see relatively uniform access across 16 cores in the same socket, with worse latencies for the CPUs in the second socket. The Graviton2’s latencies here are better than Intel in the best case, but also slightly worse in the worst case (in the same socket).

It’s to be noted that these systems have the SMT logical cores enumerated not at N+1 as in Windows, but instead first listing the cores in physical order first, and then listing the secondary SMT logical cores, hence the mirrored latencies from core 32 onwards. Noteworthy is the low logical-to-logical core latencies of 9.3ns (~3.2GHz cores), due to the coherency between such siblings happening at the L1D cache level.

Having two sockets, we’re seeing two NUMA nodes for the Xeon system, dividing up the physical memory into two virtual memory spaces and two core sets.

AMD EPYC 7571 - m5a.16xlarge - x86 CAS

Finally, the AMD EPYC system we see cores within an CCX having excellent latencies between each other, however coherency between different CCXs is slow, and even slower when accessing other dies within the socket.

The four NUMA nodes means the memory is divided into four virtual memory spaces as well as CPU groups. Again, AMD’s newer EPYC2 Rome processors get rid of this limitation, but alas weren’t able to test on AWS at this moment in time.

Amazon's Graviton2: First Neoverse N1 Chip Memory Subsystem & Latency


View All Comments

  • notladca - Tuesday, March 10, 2020 - link

    I would love to know if the product line has split within Annapurna. In other words whether Graviton2 has, like previous Annapurna SoCs, some interesting support around storage and networking for use in future Nitro. It's possible Amazon has some behind the scenes work going on with CCIX for future machines. For example integrating their Inferentia chip more closely with the SoC.

    Given the core count, it'd also be interesting to compare ML inference acceleration via fp16 and int8 dot product instructions per core vs use of GPU or Inferentia.
  • coder543 - Tuesday, March 10, 2020 - link

    One small bit of feedback: with that CPU topology chart, the coloration seems a little off. A difference of +/- 1 yields very different shades of red and orange, but the same difference on the green side of the spectrum yields no discernible difference in color? Personally, I think all of the 200 +/- 5 values in the first topology chart should be an almost uniform sea of orange/red. The important thing is the 150 difference in latency, not the +/- 1 latency, and the noise in the colors distracts the reader from the primary distinction. A lower signal to noise ratio.

    Also: what is the unit? nanoseconds? microseconds? milliseconds? I can’t figure it out, and it’s not labeled as far as I can tell.
  • Andrei Frumusanu - Tuesday, March 10, 2020 - link

    Nanoseconds, I'll add a remark. Reply
  • sing_electric - Tuesday, March 10, 2020 - link

    My tin hat is telling me to be suspicious of Amazon's pricing here. When shopping for cloud computing, perf/$ becomes VERY alluring, but I have to wonder if Amazon is willing to let its Gravitron servers be a "loss leader," artificially lowering prices to get market share until Arm on server is well-established - before then raising prices to something closer to a economically sustainable number. Reply
  • Andrei Frumusanu - Tuesday, March 10, 2020 - link

    Vertical integration is powerful. Amazon can share profits and margins division wide, not having to pay overhead to AMD/Intel. Reply
  • sing_electric - Tuesday, March 10, 2020 - link

    True, but then Amazon has to pay for the ARM license and 100% of the development/production costs. I would be very surprised if they managed to *make money* on the 1st couple Graviton generations (especially if you factor in having to buy Annapurna), since you'd need to say "of the $X generated by Graviton metal, $Y would have been spent on EC2 anyways, meaning $Z is our actual gain," and that's... probably too much to ask at this stage. Reply
  • rahvin - Tuesday, March 10, 2020 - link

    The costs you mention are nothing compared to what they pay right now with Intel or AMD with they 50% margins on top of the actual cost. IMO this initiative was born out of Intel's price increases from 2010 to now. By vertically integrating they have full control over the price structure and they have very good data on what kind of workloads are running so they can tailor the design.

    IMO it was just a question of time until Amazon tried to vertically integrate this like they've done with shipping and lots of other stuff. Bezos is following the Robber Barron growth model.
  • dotjaz - Wednesday, March 11, 2020 - link

    Huh? AMD has a gross margin of 40%, true. But keep in mind AWS has a operating margin of 30%, that mean AWS has a even higher gross margin than AMD, comparable to AMD's server department.
    Do you know what that means? For $1 of expenditure in to chip manufacturing, AWS expects to earn as much as AMD does. And since AWS don't have the volume as far as chip goes, their gross margin for chip investment will be lower, therefore not worth the investment if the decision is purely financial.

    But yes, the other point stands, AWS have better control of costing (with more leverage as well) and performance.
  • Wilco1 - Wednesday, March 11, 2020 - link

    For every $1 worth of silicon you could pay AMD $1.50, pay Intel $2 or pay TSMC $1 plus $0.20 internal development costs. Which works out best you think? Reply
  • extide - Friday, March 13, 2020 - link

    It's not that simple. AMD and Intel can spread those development costs over vastly more processors. I mean we'll never know how it truly breaks down -- but I'd imagine Amazon has figure this all out and this will be pretty profitable for them. Reply

Log in

Don't have an account? Sign up now