Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Name: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Item: Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde
Author: Anand Lal Shimpi

by Anand Lal Shimpi on May 21, 2012 12:58 PM EST

15 Comments | Add A Comment

15 Comments

Latency and overhead by GullLars

Will the GPGPU acceleration mainly improve embarrassingly parallel and compute bandwidth constrained applications, or will it also be able to accelerate smaller pieces of work that are parallel to a significant degree.

Hitherto workloads with a significant amount of data parallel components only could benefit from heterogeneous compute. However since with HSA APUs the communication between GPU and CPU is no longer subject to unnecessary copies, no cache flushes are automatically invoked, and the optimization of the runtime and driver stacks greatly reduces the dispatch latency, the type and number of workloads that are benefited from heterogeneous compute are greatly increased.

And what is the latency associated with branching off and running a piece of code on the parallel part of the APU? (f.ex. as a method called by a program to work on a large set of independent data in parallel)

Different on different products

Change starts with you by Tanclearas

Although I do agree that there are many opportunities for HSA, I am concerned that AMD's own efforts in using heterogeneous computing have been half-baked. The AMD Video Converter has a smattering of conversion profiles, lacks any user-customizable options (besides a generic "quality" slider), and hasn't seen any update to the profiles in a ridiculously long time (unless there were changes/additions within the last few months).

AMD recognizes that heterogeneous compute requires specific and new measures to ease developer adoption. To this end AMD is adopting the strategy of delivering domain-specific SDKs and providing optimized sample applications. These serve as reference code to ease the developer's job of extracting performance especially for targeted and common use cases. APP SDK is an example - stay tuned for more

It is no secret that Intel has put considerable effort into compiler optimizations that required very little effort on the part of developers to take advantage of. AMD's approach to heterogeneous computing appears simply to wait for developers to do all the heavy lifting.

The question therefore is, when is AMD going to show real initiative with development, and truly enable developers to easily take advantage of HSA? If this is already happening, please provide concrete examples of such. (Note that a 3-day conference that also invites investors is hardly a long-term, on-going commitment to improvement in this area.)

Just to clarify, HSA is not available today. We outlined our roadmap for the future of APUs last year at AFDS, which included the evolution of HSA. Most of the HSA features will be available on our 2013 and 2014 platforms. We are going to announce the schedule for availability of our HSA software stack, our tools and the library plan at AFDS. AFDS is a continued forum where we will bring together software developers to interact with us and our partners to let them know the direction of our platforms in the future. The fact that investors attend does not detract from the fact that it is targeted primarily at software developers. The overwhelming majority of presentations and talks are directed at software developers. Several key partners will be delivering keynotes at AFDS expressing their aligned view of heterogeneous computing including technical leaders from Adobe. Cloudera, Penguin Computing, Gaikai and SRS.

We have just announced the increasing gamut of software who support OpenCL on our platforms today. These include companies such as SONY, Adobe, Arcsoft, Winzip, Cyberlink, Corel, Roxio, and many, many others. We are confident all of them will be enthusiastic about supporting HSA.

In addition see the answer to the above question and what we are doing wrt making OpenCL easier to use.

Two questions by markstock

Mr. Hegde, I have two questions which I hope you will answer.

To your knowledge, what are the major impediments preventing developers from thinking about this new hierarchy of computation and begin programming for heterogenous architectures?

See my answer to the first question where I list the hardware features of HSA and the issues they solve. Those are all issues with today's heterogeneous compute models.

AMD clearly aims to fill a void for silicon with tightly-coupled CPU-like and GPU-like computational elements, but are they only targeting the consumer market, or will future hardware be designed to also appeal to HPC users?

Absolutely. We will be bringing HSA based APUs to the market in the near future and all the aspects of ease of programming and much greater performance per joule that HSA brings to the market will greatly benefit the HPC space. In fact, Penguin Computing, is already implementing APUs in HPC server designs and will be sharing details on HPC heterogeneous compute at AFDS during their keynote.

When will the software catch up? by Loki726

AMD Fellow Mike Mantor has a nice statement that I believe captures the core difference between GPU and CPU design.

"CPUs are fast because they include hardware that automatically discovers and exploits parallelism (ILP) in sequential programs, and this works well as long as the degree of parallelism is modest. When you start replicating cores to exploit highly parallel programs, this hardware becomes redundant and inefficient; it burns power and area rediscovering parallelism that the programmer explicitly exposed. GPUs are fast because they spend the least possible area and energy on executing instructions, and run thousands of instructions in parallel."

Notice that nothing in here prevents a high degree of interoperability between GPU and CPU cores.

When will we see software stacks catch up with heterogeneous hardware? When can we target GPU cores with standard languages (C/C++/Objective-C/Java), compilers(LLVM, GCC, MSVS), and operating systems (Linux/Windows)? The fact that ATI picked a different ISA for their GPUs than x86 is not an excuse; take a page out of ARM's book and start porting compiler backends.

AMD is addressing this via HSA. HSA addresses these fundamental points by introducing an intermediate layer (HSAIL) that insulates software stacks from the individual ISAs. This is a fundamental enabler to the convergence of SW stacks on top of HC.

Unless the install base is large enough, the investment to port *all* standard languages across to an ISA is forbiddingly large. Individual companies like AMD are motivated but can only target a few languages at a time. And the software community is not motivated if the install base is fragmented. HSA breaks this deadlock by providing a "virtual ISA" in the form of HSAIL that unifies the view of HW platforms for SW developers. It is important to note that this is not just about functionality but preserves performance sufficiently to make the SW stack truly portable across HSA platforms

Why do we need new languages for programming GPUs that inherit the limitations of graphics shading languages? Why not toss OpenCL and DirectX compute, compile C/C++ programs, and launch kernels with a library call? You are crippling high level languages like C++-AMP, Python, and Matlab (not to mention applications) with a laundry list of pointless limitations.

AMD sees OpenCL as a critical and necessary step in the evolution of programming. Single-core programming evolved from assembly to C++ and Java. Starting with very few expert programmers doing close-to-metal coding, to a larger number of trained professionals driving products and finally making it easier for minimally trained programming masses to target CPUs. Symmetric multi-core programming went thru a similar trend thru pthreads to models like OpenMP and TBB.

Today, pioneered by experts who managed to write compute code within shaders, heterogeneous compute now has its first standard programming model in OpenCL. AMD introduced Aparapi that provides Java developers an easy way to access GPU compute. C++ AMP is the first instance of the natural next step in this evolution, i.e. extensions of existing programming models to target GPU compute and thus bringing in the (large) community adoption. AMD will strongly support this expansion into languages like Fortran, Python, Ruby, R, Matlab…

In addition, domain-specific libraries are also being targeted, e.g. OpenCV, x264, crypto++, to allow the programmer to focus on the job at hand, instead of the mechanics of obtaining performance. This is the fastest way to enable existing application code bases to leverage heterogeneous compute.

And of course, HSA is a key enabler of this next step since it expands the install base for SW developers to target via the portable performance it enables across various ISAs.

However, similar to assembly optimizations, AMD does see OpenCL continue to coexist with high-level programming to enable performance-critical developers to extract the most out of a particular platform.

Where's separable compilation? Why do you have multiple address spaces? Where is memory mapped IO? Why not support arbitrary control flow? Why are scratchpads not virtualized? Why can't SW change memory mappings? Why are thread schedulers not fair? Why can't SW interrupt running threads? The industry solved these problems in the 80s. Read about how they did it, you might be surprised that the exact same solutions apply.

- OpenCL 1.2 (supported by the upcoming AMD APP SDK 2.7) supports clCompileProgram and clLinkProgram.
- HSA MMU enables a shared address space between CPU and GPU
- HSAIL supports more flexible control flow.
- SI-based GPUs include high-performance read/write caches which effectively can be virtualized.
- Future AMD APUs will support HW context switching, including ability for SW to interrupt running threads

Question

PRINT THIS ARTICLE

Post Your Comment
Please log in or sign up to comment.

Comments Locked

15 Comments

View All Comments

ltcommanderdata - Monday, May 21, 2012 - link
I just wanted to thank Manju Hegde for taking the time to respond to all those questions. Thank you Anand as well for organizing this. I'm definitely looking forward to the role-out and adoption of HSA.
Lucian Armasu - Monday, May 21, 2012 - link
That sounds kind of like what ARM wants to do with big.Little and Mali GPU's:

http://www.cnx-software.com/wp-content/uploads/201...

http://www.cnx-software.com/2011/11/22/midgard-arc...
CeriseCogburn - Wednesday, May 23, 2012 - link
The first page sounds to me like AMD is going for Homeland Security dollars.
" We have done extensive analysis on several workloads and have obtained significant performance per joule savings for workloads such as face detection, image stabilization, gesture recognition etc"

I see.
Don't be evil amd.
CeriseCogburn - Wednesday, May 23, 2012 - link
Then I thought maybe they are going for gaming console capabilities - but it appears on page2 they are going for logging on to Windows:

" OSes are moving towards providing some base functionality in terms of security, voice recognition, face detection, biometrics, gesture recognition, authentication, some core database functionality. All these benefit significantly from the optimizations in HSA described above. With the industry support we are building this should happen in the next few years."

So, it's probably all three.
Loki726 - Monday, May 21, 2012 - link
I'll need some time to read over all of the responses, but I want to thank Manju Hegde for taking the time to respond in such detail.
SanLouBlues - Monday, May 21, 2012 - link
I'm not sure he understood the question of the guy with the Intel CPU and AMD GPU who wanted GPU accelerated winzip.
c0d1f1ed - Monday, May 21, 2012 - link
None of the tough questions were answered, like how they're hoping to compete against AVX2 and AVX-1024.

I'm sure HSA is an improvement, but it's not developed in a vacuum. AMD has to be willing to compare it against homogeneous high throughput computing. They won't gain the respect of developers if they dodge questions. It doesn't show much confidence in their own technology.
texasti89 - Monday, May 21, 2012 - link
+1
B3an - Monday, May 21, 2012 - link
Thank you Manju for answering my questions! :) And a interesting article, still reading it...
oldguybt - Tuesday, May 22, 2012 - link
Sounds like for AMD it was surprise...?Just like they find out just now, that they have usualy x2 x3 or x4 more steam processors than Nvidia cuda cores.
Probably local "hacker" told them that than u trying to decrypt in Backtrack WPA/WPA2 internet password AMD GPU do that 4 or 5 times faster.
Well i think i find out that faster than AMD :D

ok sorry for stupidity....
Anyway AMD saying that their GPU`s made for gaming is actualy is like TESLA or i`m wrong?? :)

Just buy Nvidia,

Answered by the Experts: Heterogeneous and GPU Compute with AMD’s Manju Hegde

Latency and overhead by GullLars

Change starts with you by Tanclearas

Two questions by markstock

When will the software catch up? by Loki726

Post Your Comment

15 Comments

View All Comments

ltcommanderdata - Monday, May 21, 2012 - link

Lucian Armasu - Monday, May 21, 2012 - link

CeriseCogburn - Wednesday, May 23, 2012 - link

CeriseCogburn - Wednesday, May 23, 2012 - link

Loki726 - Monday, May 21, 2012 - link

SanLouBlues - Monday, May 21, 2012 - link

c0d1f1ed - Monday, May 21, 2012 - link

texasti89 - Monday, May 21, 2012 - link

B3an - Monday, May 21, 2012 - link

oldguybt - Tuesday, May 22, 2012 - link

Log in

Don't have an account? Sign up now