AMD 3rd Gen EPYC Milan Review: A Peak vs Per Core Performance Balanceby Dr. Ian Cutress & Andrei Frumusanu on March 15, 2021 11:00 AM EST
Disclaimer June 25th: The benchmark figures in this review have been superseded by our second follow-up Milan review article, where we observe improved performance figures on a production platform compared to AMD’s reference system in this piece.
Section by Ian Cutress
The arrival of AMD’s 3rd Generation EPYC processor family, using the new Zen 3 core, has been hotly anticipated. The promise of a new processor core microarchitecture, updates to the connectivity and new security options while still retaining platform compatibility are a good measure of an enterprise platform update, but the One True Metric is platform performance. Seeing Zen 3 score ultimate per-core performance leadership in the consumer market back in November rose expectations for a similar slam-dunk in the enterprise market, and today we get to see those results.
AMD EPYC 7003: 64 Cores of Milan
The headline numbers that AMD is promoting with the new generation of hardware is an increase in raw performance throughput of +19%, due to enhancements with the new core design. On top of this, AMD has new security features, optimizations for different memory configurations, and updated performance with the Infinity Fabric and connectivity.
Anyone looking for the shorthand specifications on the new EPYC 7003 series, known by its codename Milan, will see great familiarity with the previous generation, however this time around AMD is targeting several different design points.
Milan processors will offer up to 64 cores and 128 threads, using AMD’s latest Zen 3 cores. The processor is designed with eight chiplets of eight cores each, similar to Rome, but this time all eight cores in the chiplet are connected, enabling an effective double L3 cache design for a lower overall cache latency structure. All processors will have 128 lanes of PCIe 4.0, eight channels of memory, with most models supporting dual processor connectivity, and new options for channel memory optimization are available. All Milan processors should be drop-in compatible with Rome series platforms with a firmware update.
|AMD EPYC: Generation on Generation|
|Microarchitecture||Zen||Zen 2||Zen 3|
|Max Cores/Threads||32 / 64||64 / 128||64 / 128|
|Core Complex||4C + 8MB||4C + 16MB||8C + 32MB|
|Memory Support||8 x DDR4-2666||8 x DDR4-3200||8 x DDR4-3200|
|Memory Capacity||2 TB||4 TB||4 TB|
|PCIe||3.0 x128||4.0 x128||4.0 x128|
|Peak Power||180 W||240 W*||280 W|
|*Rome introduced 280 W for special HPC mid-cycle|
One of the highlights here is that the new generation of processors will offer 280 W models to all customers – previous generations had only 240 W models for all and then 280 W for specific HPC customers, however this time around all customers can enable those high performance parts with the new core design.
This is exemplified if we do direct top-of-stack processor comparisons:
|2P Top of Stack GA Offerings|
|uArch||Zen||Zen 2||Zen 3||Cascade|
|TDP||180 W||240 W||280 W||205 W|
|L3 Cache||64 MB||256 MB||256 MB||37.5 MB|
|PCIe||3.0 x128||4.0 x128||4.0 x128||3.0 x48|
|DDR4||8 x 2666||8 x 3200||8 x 3200||6 x 2933|
|DRAM Cap||2 TB||4 TB||4 TB||1 TB|
The new top processor for AMD is the EPYC 7763, a 64-core processor at 280 W TDP offering 2.45 GHz base frequency and 3.50 GHz boost frequency. AMD claims that this processor offers +106% performance in industry benchmarks compared to Intel’s best 2P 28-core processor, the Gold 6258R, and +17% over its previous generation 280 W version the 7H12.
Peak Performance vs Per Core Performance
One of AMD’s angles with the new Milan generation is going to be targeted performance metrics, with the company not simply going after ‘peak’ numbers, but also taking a wider view for customers that need high per-core performance as well, especially for software that is invariably per-core performance limited or licensed. With that in mind, AMD’s F-series of ‘fast’ processors is now being crystallized in the stack.
|AMD EPYC 7003 F-SeriesProcessors|
|EPYC 75F3||32 / 64||2950||4000||256
( 8 x 32 )
|EPYC 74F3||24 / 48||3200||4000||240 W||$2900|
|EPYC 73F3||16 / 32||3500||4000||240 W||$3521|
|EPYC 72F3||8 / 16||3700||4100||180 W||$2468|
These processors have the peak single threaded values of anything else in AMD’s offering, along with the full 256 MB of L3 cache, and in our results get the best scores on a per-thread basis than anything else we’ve tested for enterprise across x86 and Arm – more details in the review. The F-series processors will come at a slight premium over the others.
AMD EPYC: The Tour of Italy
The first generation of EPYC was launched in June 2017. At that time, AMD was essentially a phoenix: rising from the ashes of its former Opteron business, and with a promise to return to high-performance compute with a new processor design philosophy.
At the time, the traditional enterprise customer base were not initially convinced – AMD’s last foray into the enterprise space with a new generation of paradigm-shifting processor core, while it had successes, fell flat as AMD had to stop itself from going bankrupt. Opteron customers were left with no updates in sight at the time, and to the willingness to jump on an unknown platform from a company that had stung so many in the past was not a positive prospect for many.
At the time, AMD put out a three year roadmap, detailing its next generations and the path the company would take to overcoming the 99% market share behemoth in performance and offerings. These were seen as lofty goals, and many sat back willing to watch others take the gamble.
As the first generation Naples was launched, it offered impressive some performance numbers. It didn’t quite compete in all areas, and as with any new platform, there were some teething issues to begin. AMD kept the initial cycle to a few of its key OEM partners, before slowly broadening out the ecosystem. Naples was the first platform to offer extensive PCIe 3.0 and lots of memory support, and the platform initially targeted those storage or PCIe heavy deployments.
The second generation Rome, launched in August 2019 (+26 months) created a lot more fanfare. AMD’s newest Zen 2 core was competitive in the consumer space, and there were a number of key design changes in the SoC layout (such as moving to a NUMA flat design) that encouraged a number of skeptics to start to evaluate the platform. Such was the interest that AMD even told us that they had to be selective with which OEM platforms they were going to assist with before the official launch. Rome’s performance was good, and it scored a few high-profile supercomputer wins, but more importantly perhaps it showcased that AMD was able to execute on that roadmap back in June 2017.
That flat SoC architecture, along with the updated Zen 2 processor core (which actually borrowed elements from Zen 3) and PCIe 4.0, allowed AMD to start to compete on performance as well as simply IO, and AMD’s OEM partners have consistently been advertising Rome processors as compute platforms, often replacing two Intel 28-core processors for one AMD 64-core processor that also has higher memory support and more PCIe offerings. This also allows for compute density, and AMD was in a place where it could help drive software optimizations for its platform as well, extracting performance, but also moving to parity on the edge cases that its competitors were very optimized for. All the major hyperscalers also evaluated and deployed AMD-based offerings for their customers, as well as internally. AMD’s sticker of approval was pretty much there.
And so today AMD is continuing that tour of Italy with a trip to Milan, some +19 months after Rome. The underlying SoC layout is the same as Rome, but we have higher performance on the table, with additional security and more configuration options. The hyperscalers have already been getting the final hardware for six months for their deployments, and AMD is now in a position to help enable more OEM platforms at launch. Milan is drop-in compatible with Rome, which certainly helps, but with Milan covering more optimization points, AMD believes it is in a better position to target more of the market with high performance processors, and high per-core performance processors, than ever before.
AMD sees the launch of Milan as that third step in the roadmap that was shown back in June 2017, and validation on its ability to execute reliably for its customers but also offer above industry standard performance gains for its customers.
The next stop on the tour of Italy is Genoa, set to use AMD’s upcoming Zen 4 microarchitecture. AMD has also said that Zen 5 is in the pipeline.
AMD is launching this new generation of Milan processors approximately 19 months after the launch of Rome. In that time we have seen the launch of both Amazon Graviton2 and Ampere Altra, built on Arm’s Neoverse N1 family of cores.
|Milan Top-of-Stack Competition|
|TDP||280 W||?||250 W||205 W|
|L3 Cache||256 MB||32 MB||32 MB||37.5 MB|
|PCIe||4.0 x128||?||4.0 x128||3.0 x48|
|DDR4||8 x 3200||8 x 3200||8 x 3200||6 x 2933|
|DRAM Cap||4 TB||?||4 TB||1 TB|
From Intel, the company has divided its efforts between big socket and little socket configurations. For big sockets (4+) there is Cooper Lake, a Skylake derivative for select customers only. For smaller socket configurations (1-2), Intel is set to launch its 10nm Ice Lake portfolio at some point this year, but as yet it still remains silent on exact dates. To that end, all we have to compare Milan to is Intel’s Cascade Lake Xeon Scalable platform, which was the same platform we compared Rome to.
Interesting times for sure.
For this review, AMD gave us remote access to several identical servers with different processor configurations. We focused our efforts on the top-of-the-stack EPYC 7763, a 280 W 64-core processor, the EPYC 7713, a 225 W 64-core processor, and the EPYC 7F53, a 280 W 32-core processor designed as the halo Milan processor for per-core performance.
On the next page we will go through AMD’s Milan processor stack, and its comparison to Rome as well as the comparison to current Intel offerings. We then go through our test systems, discussions about our SoC structure testing (cache, core-to-core, bandwidth), processor power, and then into our full benchmarks.
- This Page, The Overview
- Milan Processor Offerings
- Test Bed Setups, Compiler Options
- Topology, Memory Subsystem and Latency
- Processor Power: Core vs IO
- SPEC: Multi-Thread Performance
- SPEC: Single-Thread Performance
- SPEC: Per Core Performance Win for 75F3
- SPECjbb MultiJVM: Java Performance
- Compilation and Compute Benchmarks
- Conclusions and End Remarks
These pages can be accessed by clicking the links, or by using the drop down menu below.
Post Your CommentPlease log in or sign up to comment.
View All Comments
lejeczek - Monday, March 15, 2021 - linkBut those Altra Q80-33 ... gee guys. I have been thinking for a while - next upgrade of the stack in the rack might as well be...
mode_13h - Monday, March 15, 2021 - linkWell, if it does well on the benchmarks that align with your workload, then I'd certainly consider at least a single-CPU Altra. IIRC, the multi-CPU interconnect was one of its weak points. You could even go dual-CPU, if you're provisioning VMs that fit on a single CPU (or better yet, just one quadrant).
Pinn - Monday, March 15, 2021 - linkWhen does this filter to the Threadrippers?
mode_13h - Monday, March 15, 2021 - linkProbably either when demand for the 3000-series Threadrippers starts slipping or if/when the supply of top-binned Zen3 dies ever catches up.
It would be interesting to see what performance could be extracted from these CPUs, if AMD would raise the power/thermal limit another 100 W. Maybe the 5000-series TR Pro will be our chance to find out!
mode_13h - Monday, March 15, 2021 - linkSomeone please remind me why Altra's memory performance is so much stronger. Is it simply down to avoiding the cache write-miss penalty? I'm pretty sure x86 CPUs long-ago added store buffers to fix that, but I can't think of any other explanation for that incredible stream benchmark discrepancy!
Andrei Frumusanu - Monday, March 15, 2021 - linkIt's due to the Neoverse N1 cores being able to dynamically transform arbitrary memory writes into non-temporal write streams instead of doing regular RFO before a write as the x86 systems are currently doing. I explain it more in the Altra review:
mode_13h - Monday, March 15, 2021 - linkThat's more or less what I recall, but do you know it's *truly* emitting non-temporal stores? Those partially-bypass some or all of the cache hierarchy (I seem to recall that the Pentium 4 actually just restricted them to one set of L2 cache). It would seem to me that implausibly deep analysis would be needed for the CPU to determine that the core in question wouldn't access the data before it was replaced. And that's not even to speak of determining whether code running on *other* cores might need it.
On the other hand, if it simply has enough write-buffering, it could avoid fetching the target cacheline by accumulating enough adjacent stores to determine that the entire cacheline would be overwritten. Of course, the downside would be a tiny bit more write latency, and memory-ordering constraints (esp. for x86) might mean that it'd only work for groups of consecutive stores to consecutive addresses.
I guess a way to eliminate some of those restrictions would be to observe through analysis of the instruction stream that a group of stores would overwrite the cacheline and then issue an allocation instead of a fetch. Maybe that's what Altra is doing?
Andrei Frumusanu - Tuesday, March 16, 2021 - linkYou're over-complicating things. The core simply sees a stream pattern and switches over to nontemporal writes. They can fully saturate the memory controller when doing just pure write patterns.
mode_13h - Wednesday, March 17, 2021 - linkBut, do you know they're truly non-temporal writes? As I've tried to explain, there are ways to avoid the write-miss penalty without using true non-temporal writes.
And how much of that are you inferring vs. basing this on what you've been told from official or unofficial sources?
Andrei Frumusanu - Saturday, March 20, 2021 - linkIt's 100% non-temporal writes, confirmed by both hardware tests and architects.