Intel: Sapphire Rapids With 64 GB of HBM2e, Ponte Vecchio with 408 MB L2 Cacheby Dr. Ian Cutress on November 15, 2021 9:00 AM EST
This week we have the annual Supercomputing event where all the major High Performance Computing players are putting their cards on the table when it comes to hardware, installations, and design wins. As part of the event Intel is having a presentation on its hardware offerings, which discloses additional details about the next generation hardware going into the Aurora Exascale supercomputer.
Aurora is a contract that Intel has had for some time – the scope was originally to have a 10nm Xeon Phi based system, for which the idea was mothballed when Xeon Phi was scrapped, and has been an ever changing landscape due to Intel’s hardware offerings. It was finalized a couple of years ago that the system would now be using Intel’s Sapphire Rapids processors (the ones that come with High Bandwidth Memory) combined with new Ponte Vecchio Xe-HPC based GPU accelerators and boosted from several hundred PetaFLOPs to an ExaFLOP of compute. Most recently, Intel CEO Pat Gelsinger has disclosed that the Ponte Vecchio accelerator is achieving double the performance, above the expectations of the original disclosures, and that Aurora will be a 2+EF Supercomputer when built. Intel is expecting to deliver the first batch of hardware to the Argonne National Laboratory by the end of the year, but this will come with $300m write-off on Intel’s Q4 financials. Intel is expecting to deliver the rest of the machine through 2022 as well as ramp up the production of the hardware for mainstream use through Q1 for wider spread launch in the first half of the year.
Today we have additional details about the hardware.
On the processor side, we know that each unit of Aurora will feature two of Intel’s newest Sapphire Rapids CPUs (SPR), featuring four compute tiles, DDR5, PCIe 5.0, CXL 1.1 (not CXL.mem), and will be liberally using EMIB connectivity between the tiles. Aurora will also be using SPR with built-in High Bandwidth Memory (SPR+HBM), and the main disclosure is that SPR+HBM will offer up to 64 GB of HBM2e using 8-Hi stacks.
Based on the representations, Intel intends to use four stacks of 16 GB HBM2e for a total of 64 GB. Intel has a relationship with Micron, and the Micron HBM2e physical dimensions are in line with the representations given in Intel’s materials (compared to say, Samsung or SKHynix). Micron currently offers two versions of 16 GB HBM2E with ECC hardware: one at 2.8 Gbps per pin (358 GB/s per stack) and one at 3.2 Gbps per pin (410 GB/s per stack). Overall we’re looking at a peak bandwidth then between 1.432 TB/s to 1.640 TB/s depending on which version Intel is using. Versions with HBM will use an additional four tiles, to connect each HBM stack to one of SPR’s chiplets.
Based on this diagram from Intel, despite Intel stating that SPR+HBM will share a socket with traditional SPR, it’s clear that there will be versions that are not compatible. This may be an instance where the Aurora versions of SPR+HBM are tuned specifically for that machine.
On the Ponte Vecchio (PVC) side of the equation, Intel has already disclosed that a single server inside Aurora will have six PVC accelerators per two SPR processors. Each of the accelerators will be connected in an all-to-all topology to each other using the new Xe-Link protocol built into each PVC – Xe-Link supports 8 in fully connected mode, so Aurora only needing six of those saves more power for the hardware. It’s not been disclosed how they are connected to the SPR processors – Intel has stated that there will be a unified memory architecture between CPU and GPU.
The insight added today by Intel is that each Ponte Vecchio dual-stack implementation (the diagram Intel has shown repeatedly is two stacks side by side) will feature a total of 64 MB of L1 cache and 408 MB of L2 cache, backed by HBM2e.
408 MB of L2 cache across two stacks means 204 MB per stack. If we compare that to other hardware:
- NVIDIA A100 has 40 MB of L2 cache
- AMD’s Navi 21 has 128 MB of Infinity Cache (an effective L3)
- AMD’s CNDA2 MI250X in Frontier has 8 MB of L2 per ‘stack’, or 16 MB total
Whichever way you slice it, Intel is betting hard on having the right hierarchy of cache for PVC. Diagrams of PVC also show 4 HBM2e chips per half, which suggests that each PVC dual-stack design might have 128 GB of HBM2e. It is likely that none of them are ‘spare’ for yield purposes, as a chiplet based design allows Intel to build PVC using known good die from the beginning.
On top of this, we also get an official number as to the scale of how many Ponte Vecchio GPUs and Sapphire Rapids (+HBM) processors we need for Aurora. Back in November 2019, when Aurora was only listed as a 1EF supercomputer, I crunched some rough numbers based on Intel saying Aurora was 200 racks and making educated guesses on the layout – I got to 5000 CPUs and 15000 GPUs, with each PVC needing around 66.6TF of performance. At the time, Intel was already showing off 40 TF of performance per card on early silicon. Intel’s official numbers for the Aurora 2EF machine are:
18000+ CPUs and 54000+ GPUs is a lot of hardware. But dividing 2 Exaflops by 54000 PVC accelerators comes to only 37 TeraFlops per PVC as an upper bound, and that number is assuming zero performance is coming from the CPUs.
To add into the mix: Intel CEO Pat Gelsinger only a couple of weeks ago said that PVC was coming in at double the performance originally expected, allowing Aurora to be a 2EF machine. Does that mean the original performance target for PVC was ~20 TF of FP64? Apropos of nothing, AMD’s recent MI250X announcement last week showcased a dual-GPU chip with 47.9 TF of FP64 vector performance, moving to 95.7 TF in FP64 matrix performance. The end result here might be that AMD’s MI250X is actually higher raw performance than PVC, however AMD requires 560 W for that card, whereas Intel’s power numbers have not been disclosed. We could do some napkin math here as well.
- Frontier uses 560 W MI250X cards, and is rated for 1.5 ExaFlops of FP64 Vector at 30 MW of power. This means Frontier needs 31300 cards (1.5 EF / 49.7 TF) to meet performance targets, and for each 560 W MI250X card, Frontier has allocated 958 Watts of power (30 MW / 31300 cards). This is a 71% overhead for each card (which means cooling, storage systems, other compute/management etc).
- Aurora uses PVC at an unknown power, is rated for 2 ExaFlops of FP64 Vector at 60 MW of power. We know that PVC has 54000+ cards to meet performance targets, which means that the system has allocated 1053 W (that’s 60 MW / 54000) per card to include the PVC accelerator and other overheads required. If we were to assume (a big assumption I know) that Frontier and Aurora have similar overheads, then we’re looking at 615 W per PVC.
- This would end up with PVC at 615 W for 37 TF, against MI250X at 560 W for 47.9 TF.
- This raw discussion fails to discuss specific features each card has for its use case.
|Compute GPU Accelerator Comparison
|Product||Ponte Vecchio||MI250X||A100 80GB|
|Transistors||100 B||58.2 B||54.2 B|
|Tiles (inc HBM)||47||10||6 + 1 spare|
|Compute Units||128||2 x 110||108|
|Matrix Cores||128||2 x 440||432|
|INT8 Tensor||?||383 TOPs||624 TOPs|
|FP16 Matrix||?||383 TOPs||312 TOPs|
|FP64 Vector||?||47.9 TFLOPS||9.5 TFLOPS|
|FP64 Matrix||?||95.7 TFLOPs||19.5 TFLOPS|
|L2 / L3||2 x 204 MB||2 x 8 MB||40 MB|
|VRAM Capacity||128 GB (?)||128 GB||80 GB|
|VRAM Type||8 x HBM2e||8 x HBM2e||5 x HBM2e|
|VRAM Bandwidth||?||3.2 TB/s||2.0 TB/s|
|Chip-to-Chip Total BW||8||8 x 100 GB/s||12 x 50 GB/s|
|CPU Coherency||Yes||With IF||With NVLink 3|
|TSMC N6||TSMC N7|
|Form Factors||OAM||OAM (560 W)||SXM4 (400W*)
|*Some Custom deployments go up to 600W|
Intel also disclosed that it will be partnering with SiPearl to deploy PVC hardware in the European HPC efforts. SiPearl is currently building an Arm-based CPU called Rhea built on TSMC N7.
Moving forward, Intel also released a mini-roadmap. Nothing too surprising here - Intel has plans for designs beyond Ponte Vecchio, and that future Xeon Scalable processors will also have options enabled with HBM.
- Intel's Aurora Supercomputer Now Expected to Exceed 2 ExaFLOPS Performance
- Intel Teases Ponte Vecchio Xe-HPC Power On, Posts Photo of Server Chip
- Analyzing Intel’s Discrete Xe-HPC Graphics Disclosure: Ponte Vecchio, Rambo Cache, and Gelato
- Intel’s 2021 Exascale Vision in Aurora: Two Sapphire Rapids CPUs with Six Ponte Vecchio GPUs
- Intel’s Xe for HPC: Ponte Vecchio with Chiplets, EMIB, and Foveros on 7nm, Coming 2021
- Bringing Geek Back: Q&A with Intel CEO Pat Gelsinger
- Intel Architecture Day 2021: A Sneak Peek At The Xe-HPG GPU Architecture
- Intel to Launch Next-Gen Sapphire Rapids Xeon with High Bandwidth Memory
- Intel’s Xeon & Xe Compute Accelerators to Power Aurora Exascale Supercomputer
- SiPearl Lets Rhea Design Leak: 72x Zeus Cores, 4x HBM2E, 4-6 DDR5
- AMD Announces Instinct MI200 Accelerator Family: Taking Servers to Exascale and Beyond
Post Your CommentPlease log in or sign up to comment.
View All Comments
mode_13h - Thursday, November 18, 2021 - link> TFlops means single or double precision?
The article is repeatedly referencing fp64, which is also the standard for HPC.
> graph manipulation (pointer chasing
I don't know what sorts of things they do in HPC, but graph algorithms don't necessarily imply pointers. There other representations that could better-suit certain algorithms.
> Is Intel AMX one accelerator per core? Per cluster (of what size) of cores? One per chip?
Seems like one per core. The AMX registers are certainly per-thread, and dispatch uses CPU instructions. I know none of that is conclusive.
> Can you run the first few rounds of relaxation in half-precision
Do you mean BFloat16 or fp16? They have *very* different ranges. If you can use single-precision for some passes, then the answer for BFloat16 is probably "yes". fp16 strikes a much better balance between range and precision, but that can eliminate it from consideration if the range can't be bound to something it can represent.
> want to know the benefits of this skew in the design, what Argonne/DoE
> expect to do with those PV's at an algorithmic level.
Some of that information might not be so hard to find.
name99 - Thursday, November 18, 2021 - linkfp64 used to be the standard for HPC. Certain companies talking about zettascale seem to be trying to change that...
Apple AMX has per-core registers but one set of hardware.
The registers are actually stored in the co-processor, but there are four sets of them, which one used depending on the core that dispatches a particular instruction.
So "The AMX registers are certainly per-thread, and dispatch uses CPU instructions" is, like you said, not at all definitive.
I know the sort of work Argonne does (I'm a physicist!), what I don't know is how that work translates into algorithms, and how those algorithms translate into hardware desiderata.
bananaforscale - Tuesday, November 16, 2021 - linkWe know that PVC has 54000+ cards to meet "performance targets, which means that the system has allocated 1053 W (that’s 60 MW / 54000) per card"
60 MW / 54000 is 1111 W.
Samus - Wednesday, November 17, 2021 - linkI interned at Argonne for years back in college where I met some of the greatest minds in my lifetime. These are the guys saving us from ourselves.
mode_13h - Thursday, November 18, 2021 - linkCool story.
Anti-elitism is a cancer on modern human society. Maybe the most pressing and impactful problem they could tackle is how to get the masses to believe science (while also keeping science honest).
name99 - Thursday, November 18, 2021 - linkCool story.
Anti-expertise is a cancer on modern human society. Unfortunately stupidity is not a problem that can be solved, not even by the greatest minds of all time.
mode_13h - Friday, November 19, 2021 - linkCute.
Just as one needn't be athletic to admire and respect professional athletes, neither intellect nor scientific acumen are prerequisites for respecting and heeding scientists. Of course, Dunning–Kruger presents a conundrum, but I still think the sports analogy is apt -- Dunning-Kruger can apply to athletic feats, as well.
So, you'll forgive me if I'm not so quick to write off the problem as one of stupidity. Science was once held in higher regard. It could happen again. I think it's mostly a question of preconditions.
Oxford Guy - Friday, November 19, 2021 - linkFraud masquerading as science is part of its image problem.
The persistence of organized delusion (religion) is another.
mode_13h - Saturday, November 20, 2021 - link> Fraud masquerading as science is part of its image problem.
Fair point. I think the over-selling of science is one factor that lead to its fall from grace, in the mid-20th century. Certainly, in more recent times, scientists tend to be notoriously cagey and abundant in their use of qualifiers to avoid saying anything that's not well supported by data.
This is not what the public consumes, however. For quite some time, the media has rampantly over-interpreted and misinterpreted results of scientific studies, as well as over-hyping them. Then, clickbaiters, bloggers, and influencers got in on the game, taking it to a new level.
I guess my point is that public perception of science and scientists is yet another symptom of the dysfunctional media and information landscape.
That's not to let science totally off the hook. Lack of reproducibility of study results is an issue that's been coming to light, somewhat recently. One underlying cause is the incentive structure to which most researchers are subject.
There are other noted problems that have also lately garnered some attention, such as in the peer-review and gatekeeping schemes enacted by some journals and conferences.
Still, whatever internal problems science has, they're not responsible for the bulk of societal mistrust of science and scientists.
> The persistence of organized delusion (religion) is another.
I think this is a somewhat pitched battle, and not entirely necessary. There are plenty of examples where religion has come to accept science, rather than standing in opposition to it. Not least of which is the Catholic Church's acceptance of evolution and that neither the Earth nor the Sun are the center of the universe.
IMO, it's not that different from others who seek to gain advantage by pitting themselves against science. I think the issue is generally less the actual religions, and more one of their leaders.
GeoffreyA - Sunday, November 21, 2021 - linkInternally, science seems to have many problems. I get the feeling that quantum mechanics, and the Copenhagen interpretation, occupy a despotic throne. General relativity is seemingly brushed to the sides, despite being more advanced in its concept of time, compared to QM's. And the big, missing piece in physics might well be tied to this tension in time. GR men and women come up with some innovative ideas but, seemingly, are second-class citizens. Stephen Hawking, Leonard Susskind, etc., have got a dogmatism, as if only their ideas are right (for example, string theory). And don't even talk about the journals, peer-reviewing, gatekeeping. It's a disgrace to science.
As you point out, it's not science's internal problems that have inspired popular mistrust. I would say it's partly fraud and mostly religious sentiment---and I say this as a believer and a Christian. I think, but could be wrong, that many feel science is trying to dethrone God. Of course, science's job is to dethrone falsehood only, explain Nature, and find truth.
Going further, I would say, religious sentiment doesn't go easily from man's heart; and when it's directed at science, that belief can end up being pseudo-religious. Many scientists, in their attempts to show that God is redundant or false, will accept an infinite multiverse, where in one, ours, the values turned out to be just right. I'm not qualified to debate whether that's more economical/parsimonious than God, but for my part, I don't buy it. For one, it can't be falsified, is a metaphysical explanation, and stands on the same footing as God. In any case, I'm just searching for the truth, wherever it may lead, whatever it may find.