AMD’s Manju Hegde is one of the rare folks I get to interact with who has an extensive background working at both AMD and NVIDIA. He was one of the co-founders and CEO of Ageia, a company that originally tried to bring higher quality physics simulation to desktop PCs in the mid-2000s. In 2008, NVIDIA acquired Ageia and Manju went along, becoming NVIDIA’s VP of CUDA Technical Marketing. The CUDA fit was a natural one for Manju as he spent the previous three years working on non-graphics workloads for highly parallel processors. Two years later, Manju made his way to AMD to continue his vision for heterogeneous compute work on GPUs. His current role is as the Corporate VP of Heterogeneous Applications and Developer Solutions at AMD.

Given what we know about the new AMD and its goal of building a Heterogeneous Systems Architecture (HSA), Manju’s position is quite important. For those of you who don’t remember back to AMD’s 2012 Financial Analyst Day, the formalized AMD strategy is to exploit its GPU advantages on the APU front in as many markets as possible. AMD has a significant GPU performance advantage compared to Intel, but in order to capitalize on that it needs developer support for heterogeneous compute. A major struggle everyone in the GPGPU space faced was enabling applications that took advantage of the incredible horsepower these processors offered. With AMD’s strategy closely married to doing more (but not all, hence the heterogeneous prefix) compute on the GPU, it needs to succeed where others have failed.

The hardware strategy is clear: don’t just build discrete CPUs and GPUs, but instead transition to APUs. This is nothing new as both AMD and Intel were headed in this direction for years. Where AMD sets itself apart is that it is will to dedicate more transistors to the GPU than Intel. The CPU and GPU are treated almost as equal class citizens on AMD APUs, at least when it comes to die area.

The software strategy is what AMD is working on now. AMD’s Fusion12 Developer Summit (AFDS), in its second year, is where developers can go to learn more about AMD’s heterogeneous compute platform and strategy. Why would a developer attend? AMD argues that the speedups offered by heterogeneous compute can be substantial enough that they could enable new features, usage models or experiences that wouldn’t otherwise be possible. In other words, taking advantage of heterogeneous compute can enable differentiation for a developer.

That brings us to today. In advance of this year’s AFDS, Manju has agreed to directly answer your questions about heterogeneous compute, where the industry is headed and anything else AMD will be covering at AFDS. Manju has a BS in Electrical Engineering (IIT, Bombay) and a PhD in Computer Information and Control Engineering (UMich, Ann Arbor) so make the questions as tough as you can. He'll be answering them on May 21st so keep the submissions coming.

Comments Locked

101 Comments

View All Comments

  • SleepyFE - Wednesday, May 16, 2012 - link

    THE TROLL strikes again
  • gcor - Wednesday, May 16, 2012 - link

    I ask because I used to work on a Telecom's platform that used PPC chips, with vector processors that *I think* are quite analogous to GPGPU programming. We off loaded as much as possible to the vector processors (e.g. huge quantities of realtime audio processing). Unfortunately it was extremely difficult to write reliable code for the vector processors. The software engineering costs wound up being so high, that after 4-5 years of struggling, the company decided to ditch the vector processing entirely and put in more general compute hardware power instead. This was on a project with slightly less than 5,000 software engineers, so there were a lot of bodies available. The problem wasn't so much the number of people, as the number of very high calibre people required. In fact, having migrated back to generalised code, the build system took out the compiler support for the vector processing to ensure that it could never be used again. Those vector processors now sit idle in telecoms nodes all over the world.

    Also, wasn't the lack of developer take up of vector processing one of the reasons why Apple gave up on PPC and moved to Intel? Apple initially touted that they had massively more compute available than Windows Intel based machines. However, in the long run no, or almost no, applications used the vector processing compute power available, making the PPC platform no advantage.

    Anyway, I hope the problem isn't intrinsically too hard for mainstream adoption. It'll be interesting to see how x264 development gets through it's present quality issues with OpenCL.
  • BenchPress - Wednesday, May 16, 2012 - link

    Any chance this is IBM's Cell processor you're talking about? Been there, done that. It's indeed largely analogous to GPGPU programming.

    To be fair though HSA will have advantages over Cell, such as a unified coherent memory space. But that's not going to suffice to eliminate the increase in engineering cost. You still have to account for latency issues, bandwidth bottlenecks, register limits, call stack size, synchronization overhead, etc.

    AVX2 doesn't have these drawbacks and the compiler can do auto-vectorization very effectively thanks to finally having a complete 'vertical' SIMD instruction set. So you don't need "high caliber people" to ensure that you'll get good speedups out of SPMD processing.
  • _vor_ - Wednesday, May 16, 2012 - link

    Enough with the AVX2 Nerdrage. Seriously.
  • BenchPress - Wednesday, May 16, 2012 - link

    What is your problem with AVX2?

    If hypothetically some technology was superior to GPGPU, wouldn't you want to know about it so you can stop wasting your time with GPGPU? What if that technology is AVX2?

    I'm completely open to the opinion that it's not, but I haven't seen technical arguments yet of the contrary. So please be open to possibility that GPGPU won't ever deliver on its promise and will be surpassed by homogeneous high throughput computing technology.
  • _vor_ - Wednesday, May 16, 2012 - link

    lol. Ok seriously. Are they paying you per post?
  • BenchPress - Wednesday, May 16, 2012 - link

    No, nobody's paying me to post here.

    Please read gcor's post again. He raised very serious real world concerns about heterogeneous computing. So I'm just trying to help him and everyone else by indicating that with AVX2 we'll get the performance boost of SPMD processing without the disadvantages of a heterogeneous architecture.

    Is it so hard to believe that someone might be willing to help other people without getting paid for it? I don't see why you have a problem with that.
  • SleepyFE - Friday, May 18, 2012 - link

    How would AVX2 handle graphics processing?
  • BenchPress - Friday, May 18, 2012 - link

    I am only suggesting using AVX2 for general purpose high throughput computing. Graphics can still be done on a highly dedicated IGP or discrete GPU.

    This is the course NVIDIA is taking. With GK104, they invested less die space and power consumption on features that would only benefit GPGPU. They realize the consumer market has no need for heterogeneous computing since the overhead is high, it's hard to develop for, it sacrifices graphics performance, and last but not least the CPU will be getting high throughput technology with AVX2.

    So let the CPU concentrate on anything general purpose, and let the GPU concentrate on graphics. The dividing line depends on whether you intend on reading anything back. Not reading results back, like with graphics, allows the GPU to have higher latencies and increase the computing density. General purpose computing demands low latencies and this is the CPU's strong point. AVX2 offers four times the floating-point throughput of SSE4 so that's no longer a reason to attempt to use the GPU.
  • SleepyFE - Saturday, May 19, 2012 - link

    So you want a GPU and an extralarge CPU (to acomodate AVX)?
    That is whet they do NOT want. In case you didn't notice they are trying to make better use of the resources available. With Bulldozer modules they cut the FPU-s in half. The GPU will handle showing the desktop and doing some serios computing (what idiot would play a game while waiting). That is the point. Smaller dies doing more (possibly at all times). Efficiency.

Log in

Don't have an account? Sign up now