Under the high-level overview of the cell section, the PPE has 64KB L1 and 512KB L2 cache.
On the other hand, under the on-die memory controller section, we see that the XDR memory gives bandwidth of 25.6GB/sec, and the integrated memory controller "significantly reduces memory latencies".
My question then is, what good is the L1 and L2 cache doing? Given the amount of real estate those transistors take up, isn't it more economical to use the system RAM exclusively? The L2 cache takes up about the same amount of space as an SPE, not that it would help but so much to put another one on the die, but what effect on performance would getting rid of the L2 or even L1 cache have on memory with such high bandwidth?
Oh #59 it's funnier than that the PPE does all the work the modern CPU does with logic, and the easy stuff is done by the extra procs... but that means the messy {think calc equations} can not be done by the extra proc so if your game requires more abstract equations vs simple math {the math understadning simple math} say AI vs drawing boxes adn cubes, your machine will be dependant of the smaller proc, and the pipeline length is a game of balence prediction vs speed meaning that if you can predict a full pipeline it is much faster if the pipe is longer the vs you miss with the prediction at some point in the pipe and everthing after that point is lost so the longer the pipe is after the miss is a loss. So a shorter pipe is not nessacry better as there are tasks the P4 excells at because it has the huge pipe and the longer the pipe the high you can scale the proc speed, which is why intel chose such a huge pipe knowing the misses would hurt but at the time people still wanted every mhz possible. AMD has a 14 stage pipe because they use decent prediction but better register use, as well as fast pathing, but the biggest reason x86 is fast is because as long as it works theres reams of code out there to reach the sun a new system will require human hours to clean up so that is can take all the short cuts that x86 already does. So if the dev's are laughing now it is becasue the know it going to be very unfriendly to code for and are frustrated that the hardware which has years of effort going into it's design is not being designed to be easier to code for and to do the hardwork for us instead of the doing all the easy work faster which doesn't help us and making the hardwork harder! and in some cases run slower because it was cheaper. I understand how much money M$ lost, which was passed on to nvidia, so for them they won't get away with that this time so they will have to make it cheaper this time around.
#55
Regarding the interview from GameSpot:
He (the guy who is very upset with having to program for in-order cores) states that code will run very crappy on these new cores. Well... I don't know exactly how many pipeline stages the new cores have, but they will without a doubt have a LOT less stages than modern out-of-order core. If you also spend a great amount of design effort to make sure the branch target is calculated very early in the pipeline and couple that with a high clock frequency, you might not even need to fetch your bag of kleenexes to dry your eyes.
Of course, I don't know how long the pipeline in a Cell PPE or in the Xenon's cores is, but everything points to a very short one. Also I don't know how early the branch target is calculated, but I bet it's pretty early.
As an end remark I might add that "computer engineers are not stupid people". In the interview, the guy make it sound like it will be impossible to run gameplay code on the new console CPUs..... I personally don't think that IBM and Sonys engineers will design a CPU with such a little amount of care.
Transputer anyone? The computer on a chip that could be massively parralleled? Difficult to program but this cell is not such a great leap in ideas but with the corporate weight may succeed where others have failed and break the x86 limitations put on PCs. If the busses are big enough it would be nice to be able to plug in extra CPUs on a card or such like to upgrade or speed up a system without to much difficulty as long as the software is not CPU limited. But as before it's best not to hold your breath!
PS1 was easy to program, so that took off. Sony made PS2 very hard to program if you want to use its vector units efficiently, but since the game developers are already on board, they had to live with it. And sony will dump the same heap onto developers again with the PS3.
With this kind of complexity, I have a feeling that middleware companies will thrive. Game developers want to create content more than write assembly code, so a few middleware companies will probably supply the libraries while everyone else licenses them. Of course Microsoft has a head start since DirectX already exists and is included in the devkit, but then again, the xbox2 is not as massively parallel.
All right, here we go. "How Sony and Microsoft are about to screw your game design." These are games in the good old days. We didn't exactly have the best physique, but we were at least a balanced individual, you walk out on the beach, and you were like, you know, pathetic. But you know, you looked like a normal person. These are games today. We've been working really hard--I mean, you can maybe make the argument that this is the game--these are games today. I gotta little more work on that left arm to do, it's going to be as big as our graphics arm soon. This is kind of lame. We really want to be this guy don't we?
Unknown Speaker: No!
[laughter]
Chris Hecker: OK, he was the best guy I could find in like, three seconds in the WiFi network out in the lobby. All right. But how do we get there? Well, I'm going to take a little diversion here. I'm a programmer, so, I have two technical slides, really one technical slide. And that's about it. All right, ready? So there are two kinds of code in a game basically. There's gameplay code and engine code. Engine code, like graphics and physics, takes really giant data structures of homogenous data. I mean, it's all the same, like a lot of vertices are all a big matrix, or whatever, but usually floating point data structures these days. And you have a single small, relatively small hour that grinds away on that. This code is like, wow, it has a lot of math in it, it has to be optimized for super scalar, blah, blah, blah. It's just not actually that hard to write, right? It's pretty well defined what this code does.
The second kind of code we have is AI and gameplay code. Lots of little exceptions. Even if you're doing a simulation-y kind of game, there's tons of tunable parameters, [it's got a lot of interactions], it's a mess. I mean, this code--you look at the gameplay code in the game, and it's crap. Compared to like, my elegant physics simulator or whatever. But this is a code that actually makes the game feel different. This is the kind of code we want to be easy to write and so we can do more experimental stuff. Here is the terrifying realization about the next generation of consoles. I'm about to break about a zillion NDAs, but I didn't sign any NDAs so that's totally cool!
I'm actually a pretty good programmer and mathematician but my real talent is getting people to tell me stuff that they're not supposed to tell me. There we go. Gameplay code will get slower and harder to write on the next generation of consoles. Why is this? Here's our technical slide. Modern CPUs, like the Intel Pentium 4, blah, blah, blah, Pentium [indiscernible] or laptop, whatever is in your desktop, and all the modern power PCs, use what's called 'out of order' execution. Basically, out of order execution is there to make really crappy code run fast.
So, they basically--when out of order execution came out on the P6, the Pentium 6 [indiscernible] the Pentium 5, the original Pentium and the one after that. The Pentium Pro I think they called it, it basically annoyed a whole bunch of low level ASCII coders, because now all of a sudden, like, the crappiest-ass C code, that like, Joe junior programmer could write, is running as fast as their Assembly, and there's nothing they can do about it. Because the CPU behind their back, is like, reordering that guy's crappy ass C code, to run really well and utilize all the parts of the processor. While this annoyed a whole bunch of people in Scandinavia, it actually…
[laughter]
And this is a great change in the bad old days of 'in order execution,' where you had to be an Assembly language wizard to actually get your CPU to do anything. You were always stalling in the cache, you needed to like--it was crazy. It was a lot of fun to write that code. It wasn't exactly the most productive way of doing experimental programming.
The Xenon and the cell are both in order chips. What does this mean? The reason they did this, is it's cheaper for them to do this. They can drop a lot of core--you know--one out of order core is about the size of three to four in order cores. So, they can make a lot of in order cores and drop them on a chip, and keep the power down, and sell it for cheap--what does this do to our code?
Well, it makes--it's totally fine for grinding like, symmetric algorithms out of floating point numbers, but for lots of 'if' statements in directions, it totally sucks. How do we quantify 'totally sucks?' "Rumors" which happen to be from people who are actually working on these chips, is that straight line gameplay code runs at 1/3 to 1/10 the speed at the same clock rate on an in order core as an out of order core.
This means that your new fancy 2 plus gigahertz CPU, and its Xenon, is going to run code as slow or slower than the 733 megahertz CPU in the Xbox 1. The PS3 will be even worse.
This sucks!
[laughter]
There's absolutely nothing you can do about this. Well, you can actually hope that Nintendo uses an out of order core, because they're claiming that they're going to try and make it easy to develop for--except for Nintendo basically totally flailed this generation. So maybe they'll do something next generation. Who knows? You can think about having batchable design simulation-y systems, but like, I'm a huge proponent of simulation in gameplay, but even simulation in gameplay takes kind of messy systems under the hood. And this makes your gameplay harder to write.
You want to just write the gameplay. You don't want to have to like, spend 6 years of a super hardcore engine programmer's time to figure out how to make your gameplay run super scalars. You could do PC games. They are still out of order cores, but a lot of people don't think that's an option nowadays.
It's funny looking back, he wanted them to change the CPU from the Gamecube for the next generation...They ended up using an upclocked Gamecube CPU for the Wii, and a modified tri core version of it for the Wii U.
I thought the G5 was a POWER5 proc. But I could of course be wrong. All I can say is the Cell definitely intriguing as it may be will have a rough road ahead of it and I am quite surprised that these large corporations invested so much in it, cutting edge though it might be. And as for the current forseeable future, I think when multi-core FX processors from AMD comes out, I do not believe there will be anything more devastating than that. Especially once they hit the 3 Ghz barrier with multi-cores enabled and faster DDR2-3 or even RAMBUS memory capabilities.
Since you were wrong on that, don't think that you know what is significant about the design of POWER5. There were major architechture changes made to the processor.
The only things new about Cell is its target market and being a single chip. The article mentions the TI DSP chip, but there were other similar architectures as well. One example that I'm familiar with is the MAP1310 board by CSPI. Back then, processes weren't good enough to put all the cores on a single chip but the basic architecture is the same - a PPC core to do the 'normal' stuff and two quad-core DSPs (SHARC) to do the 'work'. This board wasn't successful because it was considered too hard to program to get the performance it promised.... and this opinion is from people who live/breathe real-time systems and multiprocessing codes.
The only thing new about Cell is that a) it's all on one chip now and b) the target market is a general marketplace and not a niche.
#48. OK, I was under the impression that the G5 was based on the POWER5. You're saying it's based on the POWER4 instead?
And the POWER4 and POWER5 aren't really "completely different chips" in the same way that the P4 and P3 are different chips, or in the way that the P4 and the Opteron are different chips. I can give you a list of the differences if you want. Start at http://www.elet.polimi.it/upload/sami/architetture...
The POWER5 is designed to not only be completely compatible with the POWER4 but to also to support all the optimisations from the POWER4. The only things of significance they've done is a) move the L3 cache controller on chip; b) change the various branch predictors to bimodal instead of 1-bit; c) increase the associativity and size of the caches.
"#38. You're right that the G5 is a derivative of the POWER5. The POWER5 is dual core, each core with 2way SMT giving a total of 4 'visible' cpus to the OS. The G5 is simply a single core version of the same thing."
Err no its not. POWER4 != POWER5. Hence the different names ;)
They're completely different chips.
"Well scrotemaninov I am not disputing that the POWER architecture by IBM is brilliantly done. IBM is definitely one of those companies churning out brilliant and elegant technology always in the background.
But my problem with the POWER technology is from what I understand very limitedly, is that the POWER processors in the Mac machines are a derivative of that architecture right? Why the heck are they so damn slow then?
I mean you can buy an AMD FX 55 based on the crappy legacy x86 arch and it smokes the dual 2.5 GHz Macs easily!! Is it cause of the OS? Because so far from what I have seen, if the Macs are any indication of the performance capabilities of the POWER architecture, the Cell will not be a big hit.
I did read though at www.aceshardware.com benchmark reviews of the POWER5 architecture with some insane number of cores if I recall correctly and the benchmarks were of the charts. They are definitely not what the Macs have installed in them..."
There are slow memeory systems and then theres the one used on the G5. I've heard that you can put 8 Opterons together and still get average access times across all 8 cores that are better then a single G5. Thats probably a good part of the reason the G5 was so much slower then many people thought it would be. The rest is mainly IBM's trouble making them, and their inability to ramp clock speed like they planned on.
#38. You're right that the G5 is a derivative of the POWER5. The POWER5 is dual core, each core with 2way SMT giving a total of 4 'visible' cpus to the OS. The G5 is simply a single core version of the same thing.
As for the performance, Opteron is pretty much unbeatable for integer-bound applications. Itanium2 is unbeatable for FP applications. POWER5 is somewhere in the middle.
Most desktop applications are going to be integer bound. So it's not at all surprising that you find the G5 'slow' in that respect in comparison to the FX55. Plus, and this is the whole problem with the CELL, there's no point putting dual CPUs in there unless you can utilise them properly. If you have one process going flat out trying to run a heavy application and it's single threaded then you're only using about 1/4 of the CPUs you've bought for that application (for a dual G5 2.5), whereas the Opterons and FX55 stuff is more designed around quick, single threaded applications.
psuedo-pmos wtf? That's domino logic, it's been around forever, and it's definitely not efficient in terms of power. Oh, and it takes forever to verify timing.
There were moments while reading this article that I expected there to be a "Test Yourself" quiz at the end of the chapter ... er, article. Which isn't to say that articles like this are too textbookish, it's to say that they're wonderfully educational. And very, very cool for being so.
I'm half joking when I say this (but only half) -- a real "test" at the end of the article would be fun. I could see if I really understood what I read, and even get to compare my score to the rest of the, uhm, class.
That's an interesting page, cuz everyone on OS X already knows that Word is slow on the Mac. It brings us back to the original statement that some ported software may be problematic performance-wise.
And the generic comment on the Mac side about Premiere is, well... use Final Cut Pro. :) Here is a test that seems a bit more useful, since it tests Cinema4D and After Effects, two apps that people use on the Mac and both of which are reasonably well optimized:
That's a good point about the memory scaling though. The IMC with AMD's chips is a definite advantage. I'm sure the G5 970MP dual-core won't get an IMC either.
Anyways, as far as this article is concerned, the G5 is kinda irrelevant. The interesting part for Apple in Cell is the PPE unit. It's also interesting that Anand says the original SPE was supposed to be VMX/Altivec. But the current SPE is not Altivec so it's less applicable for Apple, at least in the near term.
It would be interesting to know how fast a dual-core 3 GHz PPE would be in general laptop-type code, and how much power it would put out.
Hmm that is interesting what you say Eug. I see your point do you have any links on straight comparos between an FX and a top of the line Mac? Or from personal experience folding and such...
#38. It's a mistake to say an AMD FX 55 smokes a dual G5 2.5. For instance, if you like scientific dual-threaded stuff, the G5 does very well. However, the AMD FX 55 IS faster than a single G5 2.5. It's got a slight edge clock-for-clock, and it's clocked slightly higher too.
The real problem is when you have stuff built for x86 ported over to PPC. It just isn't great on the Mac side performance-wise in that situation. And Macs aren't tweaked for gaming either. The AMD is going to smoke the Mac in Doom 3 of course.
I think with the performance advantage of the Opteron, I'd put a single G5 2.5 in the range of performance of a single Opteron 2.2-2.4 GHz, depending on the app. The real interesting part though will be the coming quarter, when the new G5s are released. They should get a significant clock speed bump (20%?) and information on dual-core G5s are already out there (like with AMD and their dual-core Athlons). They also get a cache boost. Right now they only have 512 KB, but are expected to get 1 MB L2.
Well scrotemaninov I am not disputing that the POWER architecture by IBM is brilliantly done. IBM is definitely one of those companies churning out brilliant and elegant technology always in the background.
But my problem with the POWER technology is from what I understand very limitedly, is that the POWER processors in the Mac machines are a derivative of that architecture right? Why the heck are they so damn slow then?
I mean you can buy an AMD FX 55 based on the crappy legacy x86 arch and it smokes the dual 2.5 GHz Macs easily!! Is it cause of the OS? Because so far from what I have seen, if the Macs are any indication of the performance capabilities of the POWER architecture, the Cell will not be a big hit.
I did read though at www.aceshardware.com benchmark reviews of the POWER5 architecture with some insane number of cores if I recall correctly and the benchmarks were of the charts. They are definitely not what the Macs have installed in them...
#35: different approaches to solving the same problem.
Intel came up with x86 a long time ago and it's complete rubbish but they maintain it for backwards compatibility (here's an argument for Open Source Software if ever there was one...). They have huge amounts of logic to effectively translate x86 into RISC instructions - look at the L1I Trace Cache in the P4 for example.
IBM aren't bound by the same constraints - their PowerPC ISA is really quite nice and so there's no where near the same amount of pain suffered trying to deal with the same problem. It does seem however, that IBM are almost at the point that Intel want to be in 10 years time...
it mentions (or alludes) in the article that having no cache means that knowing exactly when an instruction would be executed is possible, is the memory interface therefore a strict "real time system" ?
Well, I dont really see the Cell 'breaking' in any way. Between being in the PS3, IBM servers/supercomputers, and Sony and Toshiba electronics, the chip will be all over the place.
As for it showing up in PCs... no it wont happen anytime soon, but I really dont think it's intended to at this point. Workstation and playstations are its main concern, and smartly so. The Cell in its first generation isnt cut out for superior general tasking, obviously, but when those things start pumping out (and they will... the PS2 has sold what, 80 million units?), there will likely be different and more advanced versions. And if some of those are changed for enhanced general purposing somehow or another, then they could have shot at entering the PC world. As for taking on Intel, though... I dont think IBM is even considering that. If I had to guess, if they wanted to be in a PC, they would have OS X adapted to Cell and IBM would have these things in Apples.
But no matter which way they go, is it me or does IBM seem light-years ahead of Intel? After looking at Intel's future plans, it seems that they are trying to move towards what IBM is doing now. So is the Cell a processor just ahead of its time, or has Intel just gotten behind?
What will make or break the Cell is the tools available, especially the operating system and libraries.
I would like to see what they're doing in terms of marketing the chip to consumer electronics, telecom, military and other embedded applications. I could see the Cell as a viable alternative to the usual mixures of PowerPcs, ARMs and DSPs.
I also agree with Final Words; I don't see the Cell breaking into the consumer PC market any time soon either.
I'm just wondering how well a dual-core PPE-based 4+ GHz chip would do in general purpose (desktop) code.
And I also wonder how cool/hot such a chip would be. The Xbox 2's CPU is probably a 3-core PPE, but it runs at 3 GHz, and we don't have power specs for it anyway.
#11 (well, everyone should if they haven't before) read the Arstechnica article on PS2 vs PC - static applications vs dynamic media. Cell is taking it to the next level.
Damn. Awesome article. If I hadn't known the site and author beforehand, I would've guessed Ars and Hannibal. Seems he isn't the only one with a talent for these kinds of articles ;)
You should do more of them.
#22: This is just a guess so don't rely on this. The POWER5 has 2way SMT. Each cycle it fetches 8 instructions from the L1I cache. All instructions fetched per cycle are for the same thread so it alternates (round robin). It also has capabilities for setting the thread priority so that you effectively run with 1 thread and it just fetches 8 instructions per cycle for the one running thread.
I would expect the PPE to be similar to this, fetching 2 instructions for the same thread each cycle. The POWER5 has load balancing stuff in there too - if one thread keeps missing in L2 then the other thread gets more instructions decoded in order to keep the CPU functional unit utilisation up. I've no idea whether this kind of stuff has made it over into the PPE, I'd be a little surprised if it has, especially seeing as this is in-order anyway so it's not like you're going to be aiming for high utilisations rates.
#23: True, but I believe that when the SPE's access the outside memory they go through the cache. Sure it's a lower coherancy than we're used to but it's not much worse.
Great article.
Anand, Could you please clarify something:
I had the impression that the PPE was a SMT processor in the sense that it had to be executing 2 threads in order to issue 2 instructions per clock. In other words: I didn't think the PPE control logic could decide to issue 2 instructions from the same thread at any given clock tick, but rather that it absolutely needed an instruction from each thread to issue two instructions.
After reading the article, I don't assume my impression is right, but a comment from you would be nice.
As I come to think about it, my impression is rather identical to 2 seperate single thread in-order cores. :-)
Real concurrency is hard to do for the programmers. It's a real pain to get it right and it's hard to debug. Systematic analysis just gets too complex as there are just too many states, you end up with a huge graph/markov-model and it's just impossible to solve it tractably.
Superscalar and SMT just try to increase ILP at the CPU level without burdening the programmer or compiler-writer. However, we've pretty much come to the end of getting a CPU to go faster - at 5GHz, LIGHT travels 6cm between clocks, and an electic PD will travel slower. As it is, in the P4 pipeline, there are at least 2 stages which are simply there to allow signals to propogate across the chip. Clearly, going faster in Hz isn't going to make the pipeline go faster.
So the ONLY thing that they can do now is to put lots of cores on the same chip and then we're going to have to deal with real concurrency. IBM/Sony are doing it now with CELL and Intel will do it in a few years. It's going to happen regardless. What we need is languages which can support real concurrency. The Java Memory Model is an almost ideal fit for the CELL, but other aspects don't work out so well, maybe. We need Pi-calculus/Join-calculus constructs in languages to be able to really deal with these cpus efficiently.
Your comments about CELL not being general purpose enough are a little wrong. IBM /already/ has the CELL in workstations and are evaluating applications that will work well. Given the speed of the interconnect and the fact that it is cache-coherant, I think we'll be seeing super-computers based on many CELLs, it's an almost ideal fit (as it is, you've almost got ccNUMA on a single chip). Also, bear in mind that this is IBM's 5th (or 6th?) generation of SMT in the PPE - they've been at it MUCH longer than Intel - IBM started it in the mid-90s around the same time that the Alpha crew were working on the EV8 which was going to have 8-way thread-level parallelism (got canned sadly).
Also, if you look at IBMs heavy CPUs - the POWER5, that has SMT and dispatches in groups of 8 instructions, not the 3/4 that AMD/Intel manage.
What I'm saying here, is that sure, the SPEs don't have BPTs of BTBs, they're all 2-way dispatch and not greater, but, they all run REALLY fast, they have short pipelines (so the pain of the branch misprediction won't be so bad), and, IBM have had software branch prediction available since the POWER4, so they've been at it a few years and must have decided that compilers really can successfully predict branch directions.
Backwards compatibility doesn't matter. Sure, Microsoft took several years to support AMD64 but that didn't stop take up of the platform - everyone just ran Linux on it (well, everyone who wanted to use the 64bit CPU they'd bought). It'll only be a few months after the CELL is out that we'll have to wait until Linux can be built on it. 100quid says Microsoft will never support it.
Frankly, considering that it's far more likely to go into super-computer or workstation environments, no one there gives a damn about backwards compatibility or Windows support. No one in those environments /wants/ a damn paper clip.
#14: Replace 'lazy developers' with 'developers on a budget' and you will have a true statement. Its not an issue of laziness, its an issue of having the budget to optimize fully for a platform.
Great article Anand!! Yeah I actually get to bring my Comp150 knowledge to bear in reading this article! If this had come out 6 months ago I would have been totally lost. It will indeed be interesting to see what headway Cell can make, however unfortunately as Anand alludes to the x86 architecture is just too heavily entrenched for anything to budge it except the Big 2 (AMD and Intel). I can't wait to see what type of power the Playstation 3 will have though, and especially how that power will be utilized in games. I bet there will be some jaw dropping graphics awaiting us there. That is if Cells limitations don't hold back lazy game developers and lead to a string of mediocre games punctuated by a few amazing titles made by independent developers who really care to utilize the architecture. Didn't the Playstantion 1 suffer something similar?
The real world technology article on the cell, states that it gives up single thread performance in favour of runing many parallel threads. That sounds like a terrible difficult processor to development games for.
I for one think it will be easier to put the burden on the hardware rather than on the software side.
Can we see another repeat of PS2? Technically impressive, but hard to code for.
11 - I think the point is that games tend to use certain functions of a CPU much more frequently, while general business/office applications make use of a wider range of generic operations. I understand your complaint, as office applications generally don't need a lot more power than about 1.5 GHz at most. However, the key of the statement was the "general purpose microprocessor" and not the "very powerful" part.
"Performance in business/office applications requires a very powerful, very fast general purpose microprocessor, but performance in a game console, for example, does not."
WHAT??????? Hello?? So an office app like Word needs a very powerful processor, but a game console does not? I beg to differ. I suppose it depends on how you define "business/office application" but I think that statement is WAY off. I know several current office applications that will limp along on a pentium 133, but no current game has any hope on the same CPU.
It was clear to me that meant console CPUs didn't have to be as general purpose and brute force powerful in every regard - they can get away with being more specialized, and suck at general work, but still fast for game specific code.
Interesting stuff. The Playstation has always been something of a pain in the rear to program. PS1 went it's own way, and PS2 did the same. PS3 and Cell seem ready to pave new roads into the "OMG this is really complex" land of programming. I'm glad I've given up serious programming.... :)
sweet article! way over my head, but there were some parts that were dropped down to my level of understanding. Leave it to anand to tell the real story. It will be interesting to see how willing some companies will be to accomidate Sony's ratical processor... bu tas long as theirs money... Do you think that it is possible to (down the road) flop a x86 chip in place of the PPE? wouldn't hat make the Cell compatible with the current processing standards?
Describing this as a "sit down read" type of article makes me want to print it out to put it in the magazine rack, because I don't have a laptop + 802.11g to peruse AnandTech while I'm, er... ;)
In "Decode", each row has 2 columns. What do First and Second Column mean ? same as "Write" And in "Execute, each row has 3 columns. What do First, Second and Third column mean ? And how is the process ? (The current table is about "In-Order Issue with Out-of-Order Completion").
I've read it many times, in the "Instruction Level Parallelism". But I still don't have any idea about it.
In "Decode", each row has 2 columns. What do First and Second Column mean ? same as "Write" And in "Execute, each row has 3 columns. What do First, Second and Third column mean ? And how is the process ? (The current table is about "In-Order Issue with Out-of-Order Completion"). I've read it many times, in the "Instruction Level Parallelism". But I still don't have any idea about it.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
70 Comments
Back to Article
PhilAnd - Wednesday, October 5, 2005 - link
Thank you SO MUCH!!! I've been looking for an explanation of the cell forever and this did it perfictly!! THANK YOU!!! YOU ARE GOD!!!philpoe - Sunday, July 31, 2005 - link
Under the high-level overview of the cell section, the PPE has 64KB L1 and 512KB L2 cache.On the other hand, under the on-die memory controller section, we see that the XDR memory gives bandwidth of 25.6GB/sec, and the integrated memory controller "significantly reduces memory latencies".
My question then is, what good is the L1 and L2 cache doing? Given the amount of real estate those transistors take up, isn't it more economical to use the system RAM exclusively? The L2 cache takes up about the same amount of space as an SPE, not that it would help but so much to put another one on the die, but what effect on performance would getting rid of the L2 or even L1 cache have on memory with such high bandwidth?
tipoo - Wednesday, December 2, 2015 - link
L1 and L2 latency isn't even approached by the fastest system RAM latencies, XDR included. Nanoseconds vs milliseconds.jiulemoigt - Saturday, March 26, 2005 - link
Oh #59 it's funnier than that the PPE does all the work the modern CPU does with logic, and the easy stuff is done by the extra procs... but that means the messy {think calc equations} can not be done by the extra proc so if your game requires more abstract equations vs simple math {the math understadning simple math} say AI vs drawing boxes adn cubes, your machine will be dependant of the smaller proc, and the pipeline length is a game of balence prediction vs speed meaning that if you can predict a full pipeline it is much faster if the pipe is longer the vs you miss with the prediction at some point in the pipe and everthing after that point is lost so the longer the pipe is after the miss is a loss. So a shorter pipe is not nessacry better as there are tasks the P4 excells at because it has the huge pipe and the longer the pipe the high you can scale the proc speed, which is why intel chose such a huge pipe knowing the misses would hurt but at the time people still wanted every mhz possible. AMD has a 14 stage pipe because they use decent prediction but better register use, as well as fast pathing, but the biggest reason x86 is fast is because as long as it works theres reams of code out there to reach the sun a new system will require human hours to clean up so that is can take all the short cuts that x86 already does. So if the dev's are laughing now it is becasue the know it going to be very unfriendly to code for and are frustrated that the hardware which has years of effort going into it's design is not being designed to be easier to code for and to do the hardwork for us instead of the doing all the easy work faster which doesn't help us and making the hardwork harder! and in some cases run slower because it was cheaper. I understand how much money M$ lost, which was passed on to nvidia, so for them they won't get away with that this time so they will have to make it cheaper this time around.AndyKH - Thursday, March 24, 2005 - link
#55Regarding the interview from GameSpot:
He (the guy who is very upset with having to program for in-order cores) states that code will run very crappy on these new cores. Well... I don't know exactly how many pipeline stages the new cores have, but they will without a doubt have a LOT less stages than modern out-of-order core. If you also spend a great amount of design effort to make sure the branch target is calculated very early in the pipeline and couple that with a high clock frequency, you might not even need to fetch your bag of kleenexes to dry your eyes.
Of course, I don't know how long the pipeline in a Cell PPE or in the Xenon's cores is, but everything points to a very short one. Also I don't know how early the branch target is calculated, but I bet it's pretty early.
As an end remark I might add that "computer engineers are not stupid people". In the interview, the guy make it sound like it will be impossible to run gameplay code on the new console CPUs..... I personally don't think that IBM and Sonys engineers will design a CPU with such a little amount of care.
Regards
Andreas
TheGee - Monday, March 21, 2005 - link
Transputer anyone? The computer on a chip that could be massively parralleled? Difficult to program but this cell is not such a great leap in ideas but with the corporate weight may succeed where others have failed and break the x86 limitations put on PCs. If the busses are big enough it would be nice to be able to plug in extra CPUs on a card or such like to upgrade or speed up a system without to much difficulty as long as the software is not CPU limited. But as before it's best not to hold your breath!Slaimus - Sunday, March 20, 2005 - link
PS1 was easy to program, so that took off. Sony made PS2 very hard to program if you want to use its vector units efficiently, but since the game developers are already on board, they had to live with it. And sony will dump the same heap onto developers again with the PS3.With this kind of complexity, I have a feeling that middleware companies will thrive. Game developers want to create content more than write assembly code, so a few middleware companies will probably supply the libraries while everyone else licenses them. Of course Microsoft has a head start since DirectX already exists and is included in the devkit, but then again, the xbox2 is not as massively parallel.
stephenbrooks - Sunday, March 20, 2005 - link
Ah sod multiple cores. I always preferred playing Tetris anyhow.knitecrow - Friday, March 18, 2005 - link
GAME DEVELOPER @ GDC RANT ON NEXT GEN CONSOLEShttp://www.gamespot.com/news/2005/03/18/news_61204...
All right, here we go. "How Sony and Microsoft are about to screw your game design." These are games in the good old days. We didn't exactly have the best physique, but we were at least a balanced individual, you walk out on the beach, and you were like, you know, pathetic. But you know, you looked like a normal person. These are games today. We've been working really hard--I mean, you can maybe make the argument that this is the game--these are games today. I gotta little more work on that left arm to do, it's going to be as big as our graphics arm soon. This is kind of lame. We really want to be this guy don't we?
Unknown Speaker: No!
[laughter]
Chris Hecker: OK, he was the best guy I could find in like, three seconds in the WiFi network out in the lobby. All right. But how do we get there? Well, I'm going to take a little diversion here. I'm a programmer, so, I have two technical slides, really one technical slide. And that's about it. All right, ready? So there are two kinds of code in a game basically. There's gameplay code and engine code. Engine code, like graphics and physics, takes really giant data structures of homogenous data. I mean, it's all the same, like a lot of vertices are all a big matrix, or whatever, but usually floating point data structures these days. And you have a single small, relatively small hour that grinds away on that. This code is like, wow, it has a lot of math in it, it has to be optimized for super scalar, blah, blah, blah. It's just not actually that hard to write, right? It's pretty well defined what this code does.
The second kind of code we have is AI and gameplay code. Lots of little exceptions. Even if you're doing a simulation-y kind of game, there's tons of tunable parameters, [it's got a lot of interactions], it's a mess. I mean, this code--you look at the gameplay code in the game, and it's crap. Compared to like, my elegant physics simulator or whatever. But this is a code that actually makes the game feel different. This is the kind of code we want to be easy to write and so we can do more experimental stuff. Here is the terrifying realization about the next generation of consoles. I'm about to break about a zillion NDAs, but I didn't sign any NDAs so that's totally cool!
I'm actually a pretty good programmer and mathematician but my real talent is getting people to tell me stuff that they're not supposed to tell me. There we go. Gameplay code will get slower and harder to write on the next generation of consoles. Why is this? Here's our technical slide. Modern CPUs, like the Intel Pentium 4, blah, blah, blah, Pentium [indiscernible] or laptop, whatever is in your desktop, and all the modern power PCs, use what's called 'out of order' execution. Basically, out of order execution is there to make really crappy code run fast.
So, they basically--when out of order execution came out on the P6, the Pentium 6 [indiscernible] the Pentium 5, the original Pentium and the one after that. The Pentium Pro I think they called it, it basically annoyed a whole bunch of low level ASCII coders, because now all of a sudden, like, the crappiest-ass C code, that like, Joe junior programmer could write, is running as fast as their Assembly, and there's nothing they can do about it. Because the CPU behind their back, is like, reordering that guy's crappy ass C code, to run really well and utilize all the parts of the processor. While this annoyed a whole bunch of people in Scandinavia, it actually…
[laughter]
And this is a great change in the bad old days of 'in order execution,' where you had to be an Assembly language wizard to actually get your CPU to do anything. You were always stalling in the cache, you needed to like--it was crazy. It was a lot of fun to write that code. It wasn't exactly the most productive way of doing experimental programming.
The Xenon and the cell are both in order chips. What does this mean? The reason they did this, is it's cheaper for them to do this. They can drop a lot of core--you know--one out of order core is about the size of three to four in order cores. So, they can make a lot of in order cores and drop them on a chip, and keep the power down, and sell it for cheap--what does this do to our code?
Well, it makes--it's totally fine for grinding like, symmetric algorithms out of floating point numbers, but for lots of 'if' statements in directions, it totally sucks. How do we quantify 'totally sucks?' "Rumors" which happen to be from people who are actually working on these chips, is that straight line gameplay code runs at 1/3 to 1/10 the speed at the same clock rate on an in order core as an out of order core.
This means that your new fancy 2 plus gigahertz CPU, and its Xenon, is going to run code as slow or slower than the 733 megahertz CPU in the Xbox 1. The PS3 will be even worse.
This sucks!
[laughter]
There's absolutely nothing you can do about this. Well, you can actually hope that Nintendo uses an out of order core, because they're claiming that they're going to try and make it easy to develop for--except for Nintendo basically totally flailed this generation. So maybe they'll do something next generation. Who knows? You can think about having batchable design simulation-y systems, but like, I'm a huge proponent of simulation in gameplay, but even simulation in gameplay takes kind of messy systems under the hood. And this makes your gameplay harder to write.
You want to just write the gameplay. You don't want to have to like, spend 6 years of a super hardcore engine programmer's time to figure out how to make your gameplay run super scalars. You could do PC games. They are still out of order cores, but a lot of people don't think that's an option nowadays.
tipoo - Thursday, December 3, 2015 - link
It's funny looking back, he wanted them to change the CPU from the Gamecube for the next generation...They ended up using an upclocked Gamecube CPU for the Wii, and a modified tri core version of it for the Wii U.Houdani - Friday, March 18, 2005 - link
I think I missed something fundamental.Can the SPEs be addressed directly by software, or do they have to be fed all of their instructions by the PPE?
If they DO have to be fed be the PPE, I fail to see how the PPE can possibly feed them enough to keep them all working concurrently.
Someone throw me a bone here.
suryad - Friday, March 18, 2005 - link
I thought the G5 was a POWER5 proc. But I could of course be wrong. All I can say is the Cell definitely intriguing as it may be will have a rough road ahead of it and I am quite surprised that these large corporations invested so much in it, cutting edge though it might be. And as for the current forseeable future, I think when multi-core FX processors from AMD comes out, I do not believe there will be anything more devastating than that. Especially once they hit the 3 Ghz barrier with multi-cores enabled and faster DDR2-3 or even RAMBUS memory capabilities.tipoo - Thursday, December 3, 2015 - link
No, G5 was 970 based.Questar - Friday, March 18, 2005 - link
#50,Yes the G5 is a POWER4 derivitive.
Since you were wrong on that, don't think that you know what is significant about the design of POWER5. There were major architechture changes made to the processor.
fitten - Friday, March 18, 2005 - link
The only things new about Cell is its target market and being a single chip. The article mentions the TI DSP chip, but there were other similar architectures as well. One example that I'm familiar with is the MAP1310 board by CSPI. Back then, processes weren't good enough to put all the cores on a single chip but the basic architecture is the same - a PPC core to do the 'normal' stuff and two quad-core DSPs (SHARC) to do the 'work'. This board wasn't successful because it was considered too hard to program to get the performance it promised.... and this opinion is from people who live/breathe real-time systems and multiprocessing codes.The only thing new about Cell is that a) it's all on one chip now and b) the target market is a general marketplace and not a niche.
scrotemaninov - Friday, March 18, 2005 - link
#48. OK, I was under the impression that the G5 was based on the POWER5. You're saying it's based on the POWER4 instead?And the POWER4 and POWER5 aren't really "completely different chips" in the same way that the P4 and P3 are different chips, or in the way that the P4 and the Opteron are different chips. I can give you a list of the differences if you want. Start at http://www.elet.polimi.it/upload/sami/architetture...
The POWER5 is designed to not only be completely compatible with the POWER4 but to also to support all the optimisations from the POWER4. The only things of significance they've done is a) move the L3 cache controller on chip; b) change the various branch predictors to bimodal instead of 1-bit; c) increase the associativity and size of the caches.
Anyway, this is going off topic now...
Jacmert - Friday, March 18, 2005 - link
Rofl. Computer engineering and VLSI design. Gotta love those NMOS/PMOS transistor circuits.I never thought that I'd see stuff from my textbook explained on anandtech.com
saratoga - Friday, March 18, 2005 - link
"#38. You're right that the G5 is a derivative of the POWER5. The POWER5 is dual core, each core with 2way SMT giving a total of 4 'visible' cpus to the OS. The G5 is simply a single core version of the same thing."Err no its not. POWER4 != POWER5. Hence the different names ;)
They're completely different chips.
"Well scrotemaninov I am not disputing that the POWER architecture by IBM is brilliantly done. IBM is definitely one of those companies churning out brilliant and elegant technology always in the background.
But my problem with the POWER technology is from what I understand very limitedly, is that the POWER processors in the Mac machines are a derivative of that architecture right? Why the heck are they so damn slow then?
I mean you can buy an AMD FX 55 based on the crappy legacy x86 arch and it smokes the dual 2.5 GHz Macs easily!! Is it cause of the OS? Because so far from what I have seen, if the Macs are any indication of the performance capabilities of the POWER architecture, the Cell will not be a big hit.
I did read though at www.aceshardware.com benchmark reviews of the POWER5 architecture with some insane number of cores if I recall correctly and the benchmarks were of the charts. They are definitely not what the Macs have installed in them..."
There are slow memeory systems and then theres the one used on the G5. I've heard that you can put 8 Opterons together and still get average access times across all 8 cores that are better then a single G5. Thats probably a good part of the reason the G5 was so much slower then many people thought it would be. The rest is mainly IBM's trouble making them, and their inability to ramp clock speed like they planned on.
scrotemaninov - Friday, March 18, 2005 - link
#38. You're right that the G5 is a derivative of the POWER5. The POWER5 is dual core, each core with 2way SMT giving a total of 4 'visible' cpus to the OS. The G5 is simply a single core version of the same thing.As for the performance, Opteron is pretty much unbeatable for integer-bound applications. Itanium2 is unbeatable for FP applications. POWER5 is somewhere in the middle.
Most desktop applications are going to be integer bound. So it's not at all surprising that you find the G5 'slow' in that respect in comparison to the FX55. Plus, and this is the whole problem with the CELL, there's no point putting dual CPUs in there unless you can utilise them properly. If you have one process going flat out trying to run a heavy application and it's single threaded then you're only using about 1/4 of the CPUs you've bought for that application (for a dual G5 2.5), whereas the Opterons and FX55 stuff is more designed around quick, single threaded applications.
dmens - Friday, March 18, 2005 - link
psuedo-pmos wtf? That's domino logic, it's been around forever, and it's definitely not efficient in terms of power. Oh, and it takes forever to verify timing.Poser - Thursday, March 17, 2005 - link
There were moments while reading this article that I expected there to be a "Test Yourself" quiz at the end of the chapter ... er, article. Which isn't to say that articles like this are too textbookish, it's to say that they're wonderfully educational. And very, very cool for being so.I'm half joking when I say this (but only half) -- a real "test" at the end of the article would be fun. I could see if I really understood what I read, and even get to compare my score to the rest of the, uhm, class.
drinkmorejava - Thursday, March 17, 2005 - link
very nice, how long did it take to write that thing?Eug - Thursday, March 17, 2005 - link
#42,That's an interesting page, cuz everyone on OS X already knows that Word is slow on the Mac. It brings us back to the original statement that some ported software may be problematic performance-wise.
And the generic comment on the Mac side about Premiere is, well... use Final Cut Pro. :) Here is a test that seems a bit more useful, since it tests Cinema4D and After Effects, two apps that people use on the Mac and both of which are reasonably well optimized:
http://digitalvideoediting.com/articles/viewarticl...
That's a good point about the memory scaling though. The IMC with AMD's chips is a definite advantage. I'm sure the G5 970MP dual-core won't get an IMC either.
Anyways, as far as this article is concerned, the G5 is kinda irrelevant. The interesting part for Apple in Cell is the PPE unit. It's also interesting that Anand says the original SPE was supposed to be VMX/Altivec. But the current SPE is not Altivec so it's less applicable for Apple, at least in the near term.
It would be interesting to know how fast a dual-core 3 GHz PPE would be in general laptop-type code, and how much power it would put out.
MDme - Thursday, March 17, 2005 - link
#39, 40, 41http://www.pcworld.com/news/article/0,aid,112749,p...
remember that the athlon 64 chips scale better at higher clock speeds due to the mem controller scaling as well.
Eug - Thursday, March 17, 2005 - link
Well, one example is Cinebench 2003:The dual G5 2.0 GHz is about the same speed as a dual 0pteron 246 2.0 GHz, with a score at around 500ish.
http://www.aceshardware.com/read.jsp?id=60000284
BTW, a dual G5 2.5 GHz scores 633.
suryad - Thursday, March 17, 2005 - link
Hmm that is interesting what you say Eug. I see your point do you have any links on straight comparos between an FX and a top of the line Mac? Or from personal experience folding and such...Eug - Thursday, March 17, 2005 - link
#38. It's a mistake to say an AMD FX 55 smokes a dual G5 2.5. For instance, if you like scientific dual-threaded stuff, the G5 does very well. However, the AMD FX 55 IS faster than a single G5 2.5. It's got a slight edge clock-for-clock, and it's clocked slightly higher too.The real problem is when you have stuff built for x86 ported over to PPC. It just isn't great on the Mac side performance-wise in that situation. And Macs aren't tweaked for gaming either. The AMD is going to smoke the Mac in Doom 3 of course.
I think with the performance advantage of the Opteron, I'd put a single G5 2.5 in the range of performance of a single Opteron 2.2-2.4 GHz, depending on the app. The real interesting part though will be the coming quarter, when the new G5s are released. They should get a significant clock speed bump (20%?) and information on dual-core G5s are already out there (like with AMD and their dual-core Athlons). They also get a cache boost. Right now they only have 512 KB, but are expected to get 1 MB L2.
suryad - Thursday, March 17, 2005 - link
Well scrotemaninov I am not disputing that the POWER architecture by IBM is brilliantly done. IBM is definitely one of those companies churning out brilliant and elegant technology always in the background.But my problem with the POWER technology is from what I understand very limitedly, is that the POWER processors in the Mac machines are a derivative of that architecture right? Why the heck are they so damn slow then?
I mean you can buy an AMD FX 55 based on the crappy legacy x86 arch and it smokes the dual 2.5 GHz Macs easily!! Is it cause of the OS? Because so far from what I have seen, if the Macs are any indication of the performance capabilities of the POWER architecture, the Cell will not be a big hit.
I did read though at www.aceshardware.com benchmark reviews of the POWER5 architecture with some insane number of cores if I recall correctly and the benchmarks were of the charts. They are definitely not what the Macs have installed in them...
scrotemaninov - Thursday, March 17, 2005 - link
#35: different approaches to solving the same problem.Intel came up with x86 a long time ago and it's complete rubbish but they maintain it for backwards compatibility (here's an argument for Open Source Software if ever there was one...). They have huge amounts of logic to effectively translate x86 into RISC instructions - look at the L1I Trace Cache in the P4 for example.
IBM aren't bound by the same constraints - their PowerPC ISA is really quite nice and so there's no where near the same amount of pain suffered trying to deal with the same problem. It does seem however, that IBM are almost at the point that Intel want to be in 10 years time...
Verdant - Thursday, March 17, 2005 - link
here is a question...it mentions (or alludes) in the article that having no cache means that knowing exactly when an instruction would be executed is possible, is the memory interface therefore a strict "real time system" ?
WishIKnewComputers - Thursday, March 17, 2005 - link
Well, I dont really see the Cell 'breaking' in any way. Between being in the PS3, IBM servers/supercomputers, and Sony and Toshiba electronics, the chip will be all over the place.As for it showing up in PCs... no it wont happen anytime soon, but I really dont think it's intended to at this point. Workstation and playstations are its main concern, and smartly so. The Cell in its first generation isnt cut out for superior general tasking, obviously, but when those things start pumping out (and they will... the PS2 has sold what, 80 million units?), there will likely be different and more advanced versions. And if some of those are changed for enhanced general purposing somehow or another, then they could have shot at entering the PC world. As for taking on Intel, though... I dont think IBM is even considering that. If I had to guess, if they wanted to be in a PC, they would have OS X adapted to Cell and IBM would have these things in Apples.
But no matter which way they go, is it me or does IBM seem light-years ahead of Intel? After looking at Intel's future plans, it seems that they are trying to move towards what IBM is doing now. So is the Cell a processor just ahead of its time, or has Intel just gotten behind?
AnnihilatorX - Thursday, March 17, 2005 - link
This article is seriously a kill for a child like me. I appreciate it though. Well done Anandtechravedave - Thursday, March 17, 2005 - link
I can't wait to see what devlopers thing of the cell & the SDK's for it. I have a feeling thats what will kill the cell or make it successfull.microbrew - Thursday, March 17, 2005 - link
"System on a Chip (SoC)"What will make or break the Cell is the tools available, especially the operating system and libraries.
I would like to see what they're doing in terms of marketing the chip to consumer electronics, telecom, military and other embedded applications. I could see the Cell as a viable alternative to the usual mixures of PowerPcs, ARMs and DSPs.
I also agree with Final Words; I don't see the Cell breaking into the consumer PC market any time soon either.
Locut0s - Thursday, March 17, 2005 - link
#17 Yeah that was a bit too harsh I agree.Eug - Thursday, March 17, 2005 - link
I'm just wondering how well a dual-core PPE-based 4+ GHz chip would do in general purpose (desktop) code.And I also wonder how cool/hot such a chip would be. The Xbox 2's CPU is probably a 3-core PPE, but it runs at 3 GHz, and we don't have power specs for it anyway.
Filibuster - Thursday, March 17, 2005 - link
#11 (well, everyone should if they haven't before) read the Arstechnica article on PS2 vs PC - static applications vs dynamic media. Cell is taking it to the next level.http://arstechnica.com/articles/paedia/cpu/ps2vspc...
Very nice article Anand!
Googer - Thursday, March 17, 2005 - link
Besides a release date, is there any news or knowledge of a Linux Kit for Playstation 3 like there was for PS2? Does anyone KNOW OF Either?Illissius - Thursday, March 17, 2005 - link
Damn. Awesome article. If I hadn't known the site and author beforehand, I would've guessed Ars and Hannibal. Seems he isn't the only one with a talent for these kinds of articles ;)You should do more of them.
scrotemaninov - Thursday, March 17, 2005 - link
#22: This is just a guess so don't rely on this. The POWER5 has 2way SMT. Each cycle it fetches 8 instructions from the L1I cache. All instructions fetched per cycle are for the same thread so it alternates (round robin). It also has capabilities for setting the thread priority so that you effectively run with 1 thread and it just fetches 8 instructions per cycle for the one running thread.I would expect the PPE to be similar to this, fetching 2 instructions for the same thread each cycle. The POWER5 has load balancing stuff in there too - if one thread keeps missing in L2 then the other thread gets more instructions decoded in order to keep the CPU functional unit utilisation up. I've no idea whether this kind of stuff has made it over into the PPE, I'd be a little surprised if it has, especially seeing as this is in-order anyway so it's not like you're going to be aiming for high utilisations rates.
scrotemaninov - Thursday, March 17, 2005 - link
#23: True, but I believe that when the SPE's access the outside memory they go through the cache. Sure it's a lower coherancy than we're used to but it's not much worse.Houdani - Thursday, March 17, 2005 - link
18: Top Drawer Post.20: Thanks for the links!
fitten - Thursday, March 17, 2005 - link
"Given the speed of the interconnect and the fact that it is cache-coherant,"Only the PPC core has cache. The individual SPEs don't have cache - they have scratchpad RAM.
#22: I believe the PPC core is a dual issue core that just happens to be 2xSMT.
AndyKH - Thursday, March 17, 2005 - link
Great article.Anand, Could you please clarify something:
I had the impression that the PPE was a SMT processor in the sense that it had to be executing 2 threads in order to issue 2 instructions per clock. In other words: I didn't think the PPE control logic could decide to issue 2 instructions from the same thread at any given clock tick, but rather that it absolutely needed an instruction from each thread to issue two instructions.
After reading the article, I don't assume my impression is right, but a comment from you would be nice.
As I come to think about it, my impression is rather identical to 2 seperate single thread in-order cores. :-)
Koing - Thursday, March 17, 2005 - link
Cell looks VERY interesting.Any of you guys seen Devil May Cry 3 on the PS2? Looks great imo same with T5 and GT4.
Cell at first will be tough like most consoles. BUT eventually THE developers will get around it and make some very solidly good looking games.
Lets hope they are innovative and not just rehashed graphics and nothing else.
Thanks for the great article.
Koing
scrotemaninov - Thursday, March 17, 2005 - link
I really hate just dumping loads of links, but this basically is the available content on the CELL.http://arstechnica.com/articles/paedia/cpu/cell-1....
http://arstechnica.com/articles/paedia/cpu/cell-2....
http://realworldtech.com/page.cfm?ArticleID=RWT021...
http://www.blachford.info/computer/Cells/Cell0.htm...
http://www.realworldtech.com/page.cfm?ArticleID=RW...
http://www.hpcaconf.org/hpca11/papers/25_hofstee-c...
http://www.hpcaconf.org/hpca11/slides/Cell_Public_... (slides)
mrmorris - Thursday, March 17, 2005 - link
Brilliant article, there are few places for in-depth hardcore technology presentations but Anandtech never fails.scrotemaninov - Thursday, March 17, 2005 - link
Real concurrency is hard to do for the programmers. It's a real pain to get it right and it's hard to debug. Systematic analysis just gets too complex as there are just too many states, you end up with a huge graph/markov-model and it's just impossible to solve it tractably.Superscalar and SMT just try to increase ILP at the CPU level without burdening the programmer or compiler-writer. However, we've pretty much come to the end of getting a CPU to go faster - at 5GHz, LIGHT travels 6cm between clocks, and an electic PD will travel slower. As it is, in the P4 pipeline, there are at least 2 stages which are simply there to allow signals to propogate across the chip. Clearly, going faster in Hz isn't going to make the pipeline go faster.
So the ONLY thing that they can do now is to put lots of cores on the same chip and then we're going to have to deal with real concurrency. IBM/Sony are doing it now with CELL and Intel will do it in a few years. It's going to happen regardless. What we need is languages which can support real concurrency. The Java Memory Model is an almost ideal fit for the CELL, but other aspects don't work out so well, maybe. We need Pi-calculus/Join-calculus constructs in languages to be able to really deal with these cpus efficiently.
Your comments about CELL not being general purpose enough are a little wrong. IBM /already/ has the CELL in workstations and are evaluating applications that will work well. Given the speed of the interconnect and the fact that it is cache-coherant, I think we'll be seeing super-computers based on many CELLs, it's an almost ideal fit (as it is, you've almost got ccNUMA on a single chip). Also, bear in mind that this is IBM's 5th (or 6th?) generation of SMT in the PPE - they've been at it MUCH longer than Intel - IBM started it in the mid-90s around the same time that the Alpha crew were working on the EV8 which was going to have 8-way thread-level parallelism (got canned sadly).
Also, if you look at IBMs heavy CPUs - the POWER5, that has SMT and dispatches in groups of 8 instructions, not the 3/4 that AMD/Intel manage.
What I'm saying here, is that sure, the SPEs don't have BPTs of BTBs, they're all 2-way dispatch and not greater, but, they all run REALLY fast, they have short pipelines (so the pain of the branch misprediction won't be so bad), and, IBM have had software branch prediction available since the POWER4, so they've been at it a few years and must have decided that compilers really can successfully predict branch directions.
Backwards compatibility doesn't matter. Sure, Microsoft took several years to support AMD64 but that didn't stop take up of the platform - everyone just ran Linux on it (well, everyone who wanted to use the 64bit CPU they'd bought). It'll only be a few months after the CELL is out that we'll have to wait until Linux can be built on it. 100quid says Microsoft will never support it.
Frankly, considering that it's far more likely to go into super-computer or workstation environments, no one there gives a damn about backwards compatibility or Windows support. No one in those environments /wants/ a damn paper clip.
Reflex - Thursday, March 17, 2005 - link
#14: Replace 'lazy developers' with 'developers on a budget' and you will have a true statement. Its not an issue of laziness, its an issue of having the budget to optimize fully for a platform.GhandiInstinct - Thursday, March 17, 2005 - link
Wow Super CPU and SUPER RAMBUS? AHHHH!This will replace my computer. PS3 that is.
ceefka - Thursday, March 17, 2005 - link
Rambus'RevengeLocut0s - Thursday, March 17, 2005 - link
Great article Anand!! Yeah I actually get to bring my Comp150 knowledge to bear in reading this article! If this had come out 6 months ago I would have been totally lost. It will indeed be interesting to see what headway Cell can make, however unfortunately as Anand alludes to the x86 architecture is just too heavily entrenched for anything to budge it except the Big 2 (AMD and Intel). I can't wait to see what type of power the Playstation 3 will have though, and especially how that power will be utilized in games. I bet there will be some jaw dropping graphics awaiting us there. That is if Cells limitations don't hold back lazy game developers and lead to a string of mediocre games punctuated by a few amazing titles made by independent developers who really care to utilize the architecture. Didn't the Playstantion 1 suffer something similar?knitecrow - Thursday, March 17, 2005 - link
The real world technology article on the cell, states that it gives up single thread performance in favour of runing many parallel threads. That sounds like a terrible difficult processor to development games for.I for one think it will be easier to put the burden on the hardware rather than on the software side.
Can we see another repeat of PS2? Technically impressive, but hard to code for.
JarredWalton - Thursday, March 17, 2005 - link
11 - I think the point is that games tend to use certain functions of a CPU much more frequently, while general business/office applications make use of a wider range of generic operations. I understand your complaint, as office applications generally don't need a lot more power than about 1.5 GHz at most. However, the key of the statement was the "general purpose microprocessor" and not the "very powerful" part.AnandThenMan - Thursday, March 17, 2005 - link
WAIT. What the flock does this mean?"Performance in business/office applications requires a very powerful, very fast general purpose microprocessor, but performance in a game console, for example, does not."
WHAT??????? Hello?? So an office app like Word needs a very powerful processor, but a game console does not? I beg to differ. I suppose it depends on how you define "business/office application" but I think that statement is WAY off. I know several current office applications that will limp along on a pentium 133, but no current game has any hope on the same CPU.
tipoo - Wednesday, July 30, 2014 - link
It was clear to me that meant console CPUs didn't have to be as general purpose and brute force powerful in every regard - they can get away with being more specialized, and suck at general work, but still fast for game specific code.Googer - Thursday, March 17, 2005 - link
When are they coming out? Anyone know of a release date?jeffbui - Thursday, March 17, 2005 - link
#4, I do. Heh.I've been waiting for this article forever.. thanks!
JarredWalton - Thursday, March 17, 2005 - link
Interesting stuff. The Playstation has always been something of a pain in the rear to program. PS1 went it's own way, and PS2 did the same. PS3 and Cell seem ready to pave new roads into the "OMG this is really complex" land of programming. I'm glad I've given up serious programming.... :)Googer - Thursday, March 17, 2005 - link
In soviet russia cell processor controls your mind.faboloso112 - Thursday, March 17, 2005 - link
ahh i love bedtime stories!great read...VERY informative!
ksherman - Thursday, March 17, 2005 - link
sweet article! way over my head, but there were some parts that were dropped down to my level of understanding. Leave it to anand to tell the real story. It will be interesting to see how willing some companies will be to accomidate Sony's ratical processor... bu tas long as theirs money... Do you think that it is possible to (down the road) flop a x86 chip in place of the PPE? wouldn't hat make the Cell compatible with the current processing standards?ProviaFan - Thursday, March 17, 2005 - link
Describing this as a "sit down read" type of article makes me want to print it out to put it in the magazine rack, because I don't have a laptop + 802.11g to peruse AnandTech while I'm, er... ;)xsilver - Thursday, March 17, 2005 - link
nice, definitley one of those "sit down reads".... some serious shiznit ;)cosmotic - Thursday, March 17, 2005 - link
OMG! FIRST POST LOL ROFL LMAO OMG!!! LOOK WHOS COOL!!!Fricardo - Thursday, March 17, 2005 - link
Finally! Thanks guys.Bawl - Saturday, January 25, 2014 - link
I just love this deep analysis of one of the most mist-understanding processor of the last decade.Too bad that after spending more than a half-of-billion dollars, SonyThoshibaIBM didn't release the presumably outstanding CellTwo.
Ferrx - Sunday, December 20, 2015 - link
Hi, can you help me to understand this ? I don't understand at all about these._______ _________ ______
|Decode| | Execute | | Write |
----------- ---------------- -----------
| I1 | I2 | | | | | | | |
| I3 | I4 | | I1 | I2 | | | | |
| I3 | I4 | | I1 | | | | I2 | |
| | I4 | | | | | | I1 | I3 |
| I5 | I6 | | | | I4 | | I4 | |
| | I6 | | | I5 | | | I5 | |
| | | | | I6 | | | I6 | |
_______ _________ ______
In "Decode", each row has 2 columns. What do First and Second Column mean ?
same as "Write"
And in "Execute, each row has 3 columns. What do First, Second and Third column mean ?
And how is the process ? (The current table is about "In-Order Issue with Out-of-Order Completion").
I've read it many times, in the "Instruction Level Parallelism". But I still don't have any idea about it.
Ferrx - Sunday, December 20, 2015 - link
Hi, can you help me to understand this ? I don't understand at all about these._______ _________ ______
|Decode| | Execute | | Write |
----------- ---------------- -----------
| I1 | I2 | | | | | | | |
| I3 | I4 | | I1 | I2 | | | | |
| I3 | I4 | | I1 | | | | I2 | |
| | I4 | | | | | | I1 | I3 |
| I5 | I6 | | | | I4 | | I4 | |
| | I6 | | | I5 | | | I5 | |
| | | | | I6 | | | I6 | |
_______ _________ ______
In "Decode", each row has 2 columns. What do First and Second Column mean ?
same as "Write"
And in "Execute, each row has 3 columns. What do First, Second and Third column mean ?
And how is the process ? (The current table is about "In-Order Issue with Out-of-Order Completion").
I've read it many times, in the "Instruction Level Parallelism". But I still don't have any idea about it.
Ferrx - Sunday, December 20, 2015 - link
Aww... Can't do tab-'ing' 0__0