As an industry, we are slowly moving into an era where how we package the small pieces of silicon together is just as important as the silicon itself. New ways to connect all the silicon include side by side, on top of each other, and all sorts of fancy connections that help keep the benefits of chiplet designs but also taking advantage of them. Today, AMD is showcasing its next packaging uplift: stacked L3 cache on its Zen 3 chiplets, bumping each chiplet from 32 MiB to 96 MiB, however this announcement is targeting its large EPYC enterprise processors.

AMD’s current offering in this market is its third generation EPYC 7003 processor line, also known as Milan, which offers up to 64 Zen 3 cores across eight TSMC 7nm chiplets, co-packaged with a central IO die built on GlobalFoundries 14nm. The IO die provides eight DDR4-3200 memory channels and 128 lanes of PCIe 4.0, along with other things like security. Today’s announcement, or reveal (or acknowledgement?) is that coming in Q1, AMD is going to launch Milan-X.

Milan-X is an upgraded version of Milan using the stacked L3 cache packaging technology. A 64-core version of Milan today, with eight 8-core chiplets, has 256 MiB of total L3 cache - the Milan-X version will use added L3 cache on each of those chiplets, creating a processor with a total 768 MiB of L3 cache, unrivalled by anything else in the industry. This extra L3 cache is built on a cache density optimized variant of TSMC N7, measures 36 mm2, and puts the added 64 MiB on top of the 32 MiB that is already there. The rest of the chiplet has a shim built around it to help with thermal transfer.

Given AMD’s disclosures about its stacked cache technology back in June at Computex, we already had been expecting consumer and enterprise variants to come to market at some point – AMD promised it would be coming to Zen 3 and put into production by the end of 2021, and this announcement today is confirming that timeline. As and when it will come to the consumer product line is still unannounced. That being said, today’s announcement is still lacking on explicit details.

AMD confirms that Milan-X will be socket compatible with current Milan processors (that’s the SP3 socket), but hasn’t listed any details about power, frequency, or pricing. We are expecting the L3 cache to consume some extra power, so if we are working to a 280 W limit, that would imply that there is some small frequency loss. Beyond that, using an effective +45% of 7nm silicon per chiplet (36mm2 for top cache, 80.7mm2 for bottom core die) should theoretically increase the price by +45% if AMD is wafer limited at TSMC and they want to keep the same cost per silicon unit area. The Milan-X actually represents a unique offering in the x86 market with so much L3 cache on offer per chiplet, so you can imagine that AMD could offer a nice premium over regular Milan.

We are told that is to come closer to launch in the first three months of next year (Q1 2022). However AMD is keen to point out that the increased cache is putting less bandwidth pressure on main memory, allowing for speedup of certain workloads by 66% (for EDA-based RTL Verification on Synopsys VCS) when comparing 16-core Milan with 16-core Milan-X, although the exact chiplet configuration was not disclosed.

AMD also went on to say that Microsoft will be announcing a public preview of their Azure HBV3 Series VMs with Milan-X today alongside AMD’s event, although didn’t talk about availability. Beyond that, the usual talk about expecting major OEM partners (Dell, Lenovo, HPE, Supermicro, Cisco) to adopt the new hardware in their portfolios at the full launch.

Comments Locked

24 Comments

View All Comments

  • webdoctors - Monday, November 8, 2021 - link

    Wow, soon the new meme will be can it cache Crysis?
  • nandnandnand - Monday, November 8, 2021 - link

    We're gonna need an L4.
  • don0301 - Tuesday, November 9, 2021 - link

    You, Sir, just won the Internet today :)
  • nandnandnand - Monday, November 8, 2021 - link

    804 MiB = 768 MiB L3 cache + 32 MiB L2 cache + 4 MiB L1 cache
  • Wilco1 - Monday, November 8, 2021 - link

    This requires 8 + 8 + 1 + 8 dies in total (8 of which are spacers). Wouldn't it be simpler and cheaper to use larger 120mm^2 chiplets or have a single L4 SRAM die on top of the IO die? N7 yields are more than good enough.
  • nandnandnand - Monday, November 8, 2021 - link

    Smaller dies = better yields. And if they do put an L4 cache on top of the I/O die it ought to be at least a few gigabytes.
  • Kevin G - Monday, November 8, 2021 - link

    There is a case for the L4 cache on the IO die to act exclusively in the domain of a memory channel. This would act a large reorder buffer for the memory controller to optimize read/write turn arounds and some prefetching based purely on local memory controller access patterns which would involve requests outside of the local socket domain. Even a small L4 cache can so a decent gain dependent on the system architecture and workload. IBM did something like this for their POWER chips. And it should be noted the desktop workloads would actually be a poor fit for this.

    In mobile an argument for a L4 cache to act as the system level cache for SoC blocks that don't normally have a large dedicate cache to themselves while the CPU/GPU blocks evolve to include their own private L3 caches.
  • Wilco1 - Tuesday, November 9, 2021 - link

    The IO die should fit at least 768MB of L4. Besides allowing DRAM optimization as Kevin mentions, all of it could be used by a single core if needed, allowing applications with a ~800MB working set to run completely from SRAM.

    Note yields on 7nm are good, and yields on SRAM dies are pretty much 100% irrespectively of their size.
  • nandnandnand - Tuesday, November 9, 2021 - link

    Just use DRAM/HBM for L4.
  • Wrs - Monday, November 8, 2021 - link

    They’d have to redesign the chiplets for 120mm2 - assuming that’s 16 cores with no L3. That would almost surely include changes to the ring bus as latency scales with stops on a ring. I’m already curious about L3 latency/bandwidth in the Milan-X parts. If they’ve made so few alterations to the Zen 3 CCD, I’d begin to suspect they tripled L3 but left bandwidth unchanged. Notice the lack of bandwidth stats in the latest marketing slides. TSVs can do so much more, if only they’d rework their interconnect.

    On the L4 on IOD idea, the existing latency between each CCD and IOD is measurable. That’s not to say they can’t have both massive L3s and massive L4, but each saps the bonus of the other and the end result might not be economical.

Log in

Don't have an account? Sign up now