"When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.
Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data."
Only if you are using synchronization primitives (mutex, critical section, semaphore, etc.) which are kernel objects or you call sleep() or something in the midst of reading/writing values. If you are just reading/writing a memory location, the OS doesn't know anything about it. Plus, if you have multiple CPUs/cores, more than one thread can be running simultaneously, which is where the MOESI protocols really come into play.
When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.
Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data.
"When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"
But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?"
No, MOESI doesn't help avoid the problem - It is the mechanism of how the problem is arbitrated and resolved.
Simplified example: CPU1 wants some data. The cache subsystem uses MOESI to determine that CPU0 currently owns that data. MOESI protocols are then used to transfer the ownership of that data to CPU1 (including copying the data to a different cache if necessary). Meanwhile, one (definitely the writing core) or both cores must wait while the MOESI stuff is done and then CPU1 is allowed to proceed with its write.
So, you can write a two thread program where each thread does nothing but writes a value into a memory location (both threads write to the same memory location). That cannot be avoided by anything. On every write, MOESI will be invoked to resolve the ownership of the data and make sure the processor currently wanting to write to that memory location owns it. So, these two threads will generate massive amounts of MOESI traffic between the two caches (on a multi-core or multi-processor machine) because both cores want to effectively always own that memory. While MOESI is fast, it still takes time to resolve, longer than not having to do the transfer of ownership and any copying required in any case. So, you have two cores fighting over the data and generating a lot of MOESI overhead which saps performance from both cores (both cores spend a bit of time waiting until the cache tells it that it can do its writing).
"I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples."
Absolutely. There are times when it simply cannot be avoided and must be done. But, if you can avoid it, then you probably want to avoid it :)
Good summary, that is most likely what is happening at Intel.
bob661:
"The Quest for More Processing Power, Part Three: ", that doesn't sound like a buyers guide hey? :-)
nserra:
Very astute! Ok, ok, "AMDs current dual core architecture is pretty good, let’s wait Until Intel gets it right :-).
Fitten:
I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples.
"When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"
But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?
Processes that will benefit from fast cache-cache transfers are ones that are multithreaded and the threads are manipulating the same data. There are applications that do this, but usually when you design multi-threaded applications you try to avoid these type situations. When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such. Shared (L2) caches tend to help this out because the data doesn't actually have to be transfered to the other core's cache as a part of the taking of ownership, the cache line(s) can stay right where they are with only the ownership modified.
Anyway, HPC code usually goes through pains to avoid the situation where ownership of data must switch between processes/threads often. That's why data partitioning is one of the most important steps of application design in parallel applications.
"In Part 2, Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games."
...before showing off a beautiful working demo of the Unreal 3 engine on the 7-core PS3 cell processor that was put together in only 2 months and that was relatively easy to develop according to the Unreal guys themselves... Ha! (cos Sweeney did downplay the use of multithreading in games if you read his original comments)
It is an interesting read I would say. But I would like to point out that OLTP programs will not benefit from cache2cache performance very much. That is because the very principle of multi-threaded programming requires the user account to be locked before updating. So only one thread can update an user account at any given time and other threads are blocked. Only programs that use data in single-write and multi-read form will benefit from cache2cache performance. And most likely these applications will be some sort of scientific simulations.
"AMDs current dual core architecture is vastly superior to Intels"
This is wrong!!! You said your self that Intel "new" processor was more of a “special” packing than a dual core processor, so you should say is:
"AMDs current dual core architecture is amazing let’s wait what intel will do at a latter time"
TDP is for power consuming as a 500W PSU is at it. Just because you have 500W PSU doesn’t mean it draws 500W of power.
New Venice core as more transistors than the previous core not just because of SSE3, there is new power stages than can be enable to further lower power consuming, I doubt that putting a Turion on a regular board will enable those new power stages.
"the Pentium M 2.0 GHz chips manage to run at 22W"
To be specific, they have a TDP of 22 watts which isn't really the same thing...
"as I understand it even under maximum load the Pentium M stays under 22W, right?"
Not at all...in fact it can be significantly higher than that. Intel's TDP measures an average usage under load rather than peak, while AMD's measures absolute theoretical peak under the worst conditions. This is why the TDP is quite meaningless...
I guess my point is that I am of the opinion that the Turion might actually run at significantly lower power usages. As absolutely nobody (that I am aware of) has tested beyond the system level (i.e. the chip itself), I can't be sure...but judging by the actual specs of the chips themselves (not the TDP, but the electrical specifications) it appears that the PM may indeed be higher.
I know I've asked before, but with the power usage and heat becoming more and more important, couldn't you guys develop a test of the actual realworld usage of the chips themselves?
I think it might be quite illuminating...
@Questar:
Criticizing Intel and saying good things about AMD and IBM means, that Johan is an AMD fanboy? I think not. You'll see, that the opinions about Whitefield, Merom, Yonah & Co. after hitting the public will be better than about Smithfield now. That's simply the result of the amount of effort put into the designs. A dedicated dual core design is not the same as an on die dual Xeon system.
@photoguy99:
I'd say, Johan can make this conclusion, because he has the knowledge to do so. I'd come to the same conclusion, since the Windows scheduler (at least for XP) is not so much core-aware. It just sees the logical or phyisical CPUs and if one becomes free, it just sends the next thread to it. This causes thread-hopping (can be seen in the tech-report dual core reviews thanks to task manager screenshots). In such cases it matters somewhat if the last used data is in the other L2 cache and can be quickly transferred to the current L2 cache. And it matters for multithreaded applications, which work on the same set of data.
@mazuz:
I'd suggest to look at benchmarks of a 275 vs. dual 248 with 1 dual channel memory bank and benchmarks of a dual Xeon with FSB800 and a similarly configured (cache, FSB, memory, HT) Smithfield. That's the difference caused by the SRQ-connection.
@Ahkorishaan:
The mentioned upcoming Intel cores will indeed be nice. But some people here and on many other forums sound like the dual core K8 was AMD's last CPU and the K8 their last core ever. :) However, have a look at AMD's patent portfolio and you'll see, that this is not the case. As Fred Weber said, AMD is also still looking at power consumption. This is maybe the reason, why we might see a future CPU with more cores, but less FPU power per core (due to shared FPUs).
AMD is also working on using things like clock gating and throttling (used by P-M) to further reduce power consumption. Currently they only implemented some standard features to keep power consumption down like other transistor designs (especially slower transistors in not so critical places), microarchitectural changes (better HALT mode), C3 state and PowerNow!/C'n'Q.
Viditor, I think the point is that the Pentium M 2.0 GHz chips manage to run at 22W - still less than 1/3 of what the Winchester and Venice cores put out, I think. What exactly did they do to get that low? Well, there's gating technology for sure - i.e. power down unused portions of the chip - but as I understand it even under maximum load the Pentium M stays under 22W, right?
Maybe Johan has more specifics, but I don't. I just know the price for power use on the design is very impressive, and I was surprised some of the same tech wasn't used in Prescott.
Your usual excellent work Johan, thanks.
A couple of nits to pick...
"Intel will use its P-m “know-how” to keep the power dissipation so low"
If you could qualify exactly what "know-how" you mean, that would be appreciated. IMHO, a major reason that PM is able to stay so much cooler that the Netburst chips (and on par with the Athlons) is that it doesn't have nearly as many features... Is there a reason you see the PM translating well into full blown server and desktop chips?
"Intel can leverage their experience with the power saving features of the P-m to design quad core CPUs with remarkably low TDP"
Arrrrrggghhh! This is a pet peave of mine. TDP IS NOT POWER USAGE!!! Sorry, I know you know this, but most don't and it's been quite frustrating.
For those who don't know, TDP is an arbitrary design spec for OEMs to use with the CPU...
AMD's TDP is so much higher than Intel's relative to actual power usage because AMD is much more cautious in it's design spec, not because it uses that much power.
As to Questar's comments, IMHO the fact that the worst thing he can say is a short unsubstantiated rant speaks volumes to the credibility of the article.
Thanks again Johan!
"we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers"
How can you be pretty sure when you've cited none? I know you said you'll do more testing - but *after* that testing is done seems like the time to be "pretty sure" it's a real world benefit.
You've written a good article, it was informative. Just prefer conservative research conclusions.
Intel is by no means panicking, they're riding out a storm, and things will be dicey starting about 2/3 through 2006. AMD has the advantage now, but I honestly don't know if they can hold up against the R&D budget Intel has at it's fingertips.
When P-m features get integrated into Intel's lineups AMD will be faced with the hotter, hungrier chip, and though they have more experience with the on-die Memory controller, and a nice head of steam, that might not be enough.
I'm a fan of AMD and I applaud their foresight, but they need to keep on the ball if they expect to stay ahead for another year.
I haven't finished the article yet, but would you care to clarify your objections Questar?
At least through the third page I haven't come across any assumptions or even real solid opinions he's put forth as yet.
Thus far it's merely a technically oriented analysis of their respective offerings, nothing that I've read is particularly new or debateable/controversial.
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
28 Comments
Back to Article
Viditor - Friday, May 20, 2005 - link
fitten - Thanks very much for the explanation!fitten - Friday, May 20, 2005 - link
"When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data."
Only if you are using synchronization primitives (mutex, critical section, semaphore, etc.) which are kernel objects or you call sleep() or something in the midst of reading/writing values. If you are just reading/writing a memory location, the OS doesn't know anything about it. Plus, if you have multiple CPUs/cores, more than one thread can be running simultaneously, which is where the MOESI protocols really come into play.
cz - Friday, May 20, 2005 - link
When a thread is blocked it got swapped out of the processor all together. It is the OS's job to check if some conditions are met to re-waken a thread. So a waiting thread will not be actively checking that data at any time.Only in single-write/multi-read situation (server/consumer model) those consumer threads are not blocked but actively checking for new data.
fitten - Thursday, May 19, 2005 - link
"When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?"
No, MOESI doesn't help avoid the problem - It is the mechanism of how the problem is arbitrated and resolved.
Simplified example: CPU1 wants some data. The cache subsystem uses MOESI to determine that CPU0 currently owns that data. MOESI protocols are then used to transfer the ownership of that data to CPU1 (including copying the data to a different cache if necessary). Meanwhile, one (definitely the writing core) or both cores must wait while the MOESI stuff is done and then CPU1 is allowed to proceed with its write.
So, you can write a two thread program where each thread does nothing but writes a value into a memory location (both threads write to the same memory location). That cannot be avoided by anything. On every write, MOESI will be invoked to resolve the ownership of the data and make sure the processor currently wanting to write to that memory location owns it. So, these two threads will generate massive amounts of MOESI traffic between the two caches (on a multi-core or multi-processor machine) because both cores want to effectively always own that memory. While MOESI is fast, it still takes time to resolve, longer than not having to do the transfer of ownership and any copying required in any case. So, you have two cores fighting over the data and generating a lot of MOESI overhead which saps performance from both cores (both cores spend a bit of time waiting until the cache tells it that it can do its writing).
"I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples."
Absolutely. There are times when it simply cannot be avoided and must be done. But, if you can avoid it, then you probably want to avoid it :)
JohanAnandtech - Thursday, May 19, 2005 - link
Ahkorishaan:Good summary, that is most likely what is happening at Intel.
bob661:
"The Quest for More Processing Power, Part Three: ", that doesn't sound like a buyers guide hey? :-)
nserra:
Very astute! Ok, ok, "AMDs current dual core architecture is pretty good, let’s wait Until Intel gets it right :-).
Fitten:
I agree fully that most multi threaded applications are coarse grained. But there are HPC applications where you can not avoid to work on shared data. I believe fluid dynamics, and OLTP applications that mix writes with reads (and use row locking) are examples.
Viditor - Thursday, May 19, 2005 - link
"When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such"But doesn't AMDs MOESI protocol help avoid this by allowing one cache to copy data from another?
fitten - Thursday, May 19, 2005 - link
Processes that will benefit from fast cache-cache transfers are ones that are multithreaded and the threads are manipulating the same data. There are applications that do this, but usually when you design multi-threaded applications you try to avoid these type situations. When you write a program where the threads are effectively fighting over the ownership of data, particularly in the current designs of multiprocessor (this includes multi-core) cache systems, performance will tank because of all the overhead of taking ownership and such. Shared (L2) caches tend to help this out because the data doesn't actually have to be transfered to the other core's cache as a part of the taking of ownership, the cache line(s) can stay right where they are with only the ownership modified.Anyway, HPC code usually goes through pains to avoid the situation where ownership of data must switch between processes/threads often. That's why data partitioning is one of the most important steps of application design in parallel applications.
blackbrrd - Thursday, May 19, 2005 - link
Uhm.. #19 - that is exactly the point, to check if a row is locked you most likely have to query the other caches to see if it is locked or not...JNo - Thursday, May 19, 2005 - link
"In Part 2, Tim Sweeney, the leading developer behind the Unreal 3 engine, explained the challenges of multi-threaded development of the next generation of games."...before showing off a beautiful working demo of the Unreal 3 engine on the 7-core PS3 cell processor that was put together in only 2 months and that was relatively easy to develop according to the Unreal guys themselves... Ha! (cos Sweeney did downplay the use of multithreading in games if you read his original comments)
cz - Thursday, May 19, 2005 - link
It is an interesting read I would say. But I would like to point out that OLTP programs will not benefit from cache2cache performance very much. That is because the very principle of multi-threaded programming requires the user account to be locked before updating. So only one thread can update an user account at any given time and other threads are blocked. Only programs that use data in single-write and multi-read form will benefit from cache2cache performance. And most likely these applications will be some sort of scientific simulations.nserra - Thursday, May 19, 2005 - link
The previous post was for the biased person who wrote this article. Johan De Gelas^
|Just kiding ;)
nserra - Thursday, May 19, 2005 - link
"AMDs current dual core architecture is vastly superior to Intels"This is wrong!!! You said your self that Intel "new" processor was more of a “special” packing than a dual core processor, so you should say is:
"AMDs current dual core architecture is amazing let’s wait what intel will do at a latter time"
TDP is for power consuming as a 500W PSU is at it. Just because you have 500W PSU doesn’t mean it draws 500W of power.
New Venice core as more transistors than the previous core not just because of SSE3, there is new power stages than can be enable to further lower power consuming, I doubt that putting a Turion on a regular board will enable those new power stages.
Viditor - Thursday, May 19, 2005 - link
G'day Jarred!"the Pentium M 2.0 GHz chips manage to run at 22W"
To be specific, they have a TDP of 22 watts which isn't really the same thing...
"as I understand it even under maximum load the Pentium M stays under 22W, right?"
Not at all...in fact it can be significantly higher than that. Intel's TDP measures an average usage under load rather than peak, while AMD's measures absolute theoretical peak under the worst conditions. This is why the TDP is quite meaningless...
I guess my point is that I am of the opinion that the Turion might actually run at significantly lower power usages. As absolutely nobody (that I am aware of) has tested beyond the system level (i.e. the chip itself), I can't be sure...but judging by the actual specs of the chips themselves (not the TDP, but the electrical specifications) it appears that the PM may indeed be higher.
I know I've asked before, but with the power usage and heat becoming more and more important, couldn't you guys develop a test of the actual realworld usage of the chips themselves?
I think it might be quite illuminating...
Cheers!
4lpha0ne - Thursday, May 19, 2005 - link
@Questar:Criticizing Intel and saying good things about AMD and IBM means, that Johan is an AMD fanboy? I think not. You'll see, that the opinions about Whitefield, Merom, Yonah & Co. after hitting the public will be better than about Smithfield now. That's simply the result of the amount of effort put into the designs. A dedicated dual core design is not the same as an on die dual Xeon system.
@photoguy99:
I'd say, Johan can make this conclusion, because he has the knowledge to do so. I'd come to the same conclusion, since the Windows scheduler (at least for XP) is not so much core-aware. It just sees the logical or phyisical CPUs and if one becomes free, it just sends the next thread to it. This causes thread-hopping (can be seen in the tech-report dual core reviews thanks to task manager screenshots). In such cases it matters somewhat if the last used data is in the other L2 cache and can be quickly transferred to the current L2 cache. And it matters for multithreaded applications, which work on the same set of data.
@mazuz:
I'd suggest to look at benchmarks of a 275 vs. dual 248 with 1 dual channel memory bank and benchmarks of a dual Xeon with FSB800 and a similarly configured (cache, FSB, memory, HT) Smithfield. That's the difference caused by the SRQ-connection.
@Ahkorishaan:
The mentioned upcoming Intel cores will indeed be nice. But some people here and on many other forums sound like the dual core K8 was AMD's last CPU and the K8 their last core ever. :) However, have a look at AMD's patent portfolio and you'll see, that this is not the case. As Fred Weber said, AMD is also still looking at power consumption. This is maybe the reason, why we might see a future CPU with more cores, but less FPU power per core (due to shared FPUs).
AMD is also working on using things like clock gating and throttling (used by P-M) to further reduce power consumption. Currently they only implemented some standard features to keep power consumption down like other transistor designs (especially slower transistors in not so critical places), microarchitectural changes (better HALT mode), C3 state and PowerNow!/C'n'Q.
Matthias
JarredWalton - Wednesday, May 18, 2005 - link
Viditor, I think the point is that the Pentium M 2.0 GHz chips manage to run at 22W - still less than 1/3 of what the Winchester and Venice cores put out, I think. What exactly did they do to get that low? Well, there's gating technology for sure - i.e. power down unused portions of the chip - but as I understand it even under maximum load the Pentium M stays under 22W, right?Maybe Johan has more specifics, but I don't. I just know the price for power use on the design is very impressive, and I was surprised some of the same tech wasn't used in Prescott.
Viditor - Wednesday, May 18, 2005 - link
Your usual excellent work Johan, thanks.A couple of nits to pick...
"Intel will use its P-m “know-how” to keep the power dissipation so low"
If you could qualify exactly what "know-how" you mean, that would be appreciated. IMHO, a major reason that PM is able to stay so much cooler that the Netburst chips (and on par with the Athlons) is that it doesn't have nearly as many features... Is there a reason you see the PM translating well into full blown server and desktop chips?
"Intel can leverage their experience with the power saving features of the P-m to design quad core CPUs with remarkably low TDP"
Arrrrrggghhh! This is a pet peave of mine. TDP IS NOT POWER USAGE!!! Sorry, I know you know this, but most don't and it's been quite frustrating.
For those who don't know, TDP is an arbitrary design spec for OEMs to use with the CPU...
AMD's TDP is so much higher than Intel's relative to actual power usage because AMD is much more cautious in it's design spec, not because it uses that much power.
As to Questar's comments, IMHO the fact that the worst thing he can say is a short unsubstantiated rant speaks volumes to the credibility of the article.
Thanks again Johan!
phaxmohdem - Wednesday, May 18, 2005 - link
You are all fools. IDT's Winchip X2 dual core solution will blow all of this crapolla out of the warer.mazuz - Wednesday, May 18, 2005 - link
"AMDs current dual core architecture is vastly superior to Intels"This seems like a pretty strong statement considering there doesn't seem to be any known real world advantage to this architecture.
photoguy99 - Wednesday, May 18, 2005 - link
Johan, isn't this statement a little unfounded:"we can be pretty sure that there are applications out there that do benefit from very fast cache-to-cache transfers"
How can you be pretty sure when you've cited none? I know you said you'll do more testing - but *after* that testing is done seems like the time to be "pretty sure" it's a real world benefit.
You've written a good article, it was informative. Just prefer conservative research conclusions.
bob661 - Wednesday, May 18, 2005 - link
#7Who cares which company is ahead or behind? I sure as hell don't. Give me good bang for the buck. That's all I want.
Houdani - Wednesday, May 18, 2005 - link
'Splain to me what you believe are the alleged "false assumptions."The only outright assumption I observed was located in the comments section. Specifically number two.
Ahkorishaan - Wednesday, May 18, 2005 - link
Intel is by no means panicking, they're riding out a storm, and things will be dicey starting about 2/3 through 2006. AMD has the advantage now, but I honestly don't know if they can hold up against the R&D budget Intel has at it's fingertips.When P-m features get integrated into Intel's lineups AMD will be faced with the hotter, hungrier chip, and though they have more experience with the on-die Memory controller, and a nice head of steam, that might not be enough.
I'm a fan of AMD and I applaud their foresight, but they need to keep on the ball if they expect to stay ahead for another year.
allanw - Wednesday, May 18, 2005 - link
All this talk of databases and no mention of PostgreSQL? Cmon..flatblastard - Wednesday, May 18, 2005 - link
Oh great....more fuel for the "Intel panics" thread fire.Rand - Wednesday, May 18, 2005 - link
I haven't finished the article yet, but would you care to clarify your objections Questar?At least through the third page I haven't come across any assumptions or even real solid opinions he's put forth as yet.
Thus far it's merely a technically oriented analysis of their respective offerings, nothing that I've read is particularly new or debateable/controversial.
Rapsven - Wednesday, May 18, 2005 - link
Holy ****, Questar. That's all I'm going to say for you.Very informative. Though a lot of the more technical parts of the article flew right by me.
Questar - Wednesday, May 18, 2005 - link
Wow, another AMD fanboy opinion piece based upon false assumtions. Go Anandtech!sprockkets - Wednesday, May 18, 2005 - link
not this time...nice pic on the last page, but I have no idea of the scale