Oscar's thoughts on HPC: What's this blog about?

lunes, 20 de junio de 2011

What's this blog about?

Hello World!

This blog is just a micron of the Internet where I want to write about what I do, read and think about HPC. So that is software, libraries, languages, operating systems and runtimes, and of course hardware!!!

Is a nice time to be on HPC (High Performance Computing, for those who don't know about). There is an explosion of architectures and derivative software stacks. Imagination and creativity is a big part of the equation when designing and developing computer systems, and all that is not restricted only to HPC.

As some may know, thanks to some technological limits all the computing ecosystem has to enter some dark roads. How much of "what" have I to put on what sort of processor? How and who will program that for which purposes? How much general or specific, homogeneous or heterogeneous my future architecture might be? How can I give access to C/C++ to all that madness? Crazy questions.

And more now, that a Microsoft engineer has put the cloud on to the equation. In fact, the cloud is an heterogeneous distributed system!!

So you see, I want to write a game that runs super-fast and super-smooth on next generation mobile devices. Ok, no problem, learn some multicore CPU programming if you want it to really use the hardware, because it will be a dual or quad core ARM processor. And if you need more, learn OpenCL and make your super-low-level super-tuned shader by using some 12 GPU cores or more, that are on the same chip. And we are talking about a mobile device.

Concurrency, and heterogeneity. This are the two main trends.

But why? Take a look at the AMD Fusion Developer Submit rebroadcasts, and you'll find some pretty good stuff from and ARM man, talking about "why oh why"?

It's almost all about the so called Power Wall. It sounds pretty simple. Power consumption may limit the GHz I can put on a processor. Well, that's not the only reason, and power is not the only reason for the lack of 200GHz CPU's, but is one reason. This, leads us to be forced to use multicore CPU's.
The other factor, and provably most important is: how much power does a transistor consume? If I make it smaller, how many extra transistors can I put in the same area, and how much less power will they consume? Well, the thing is that we can make transistors much smaller than greener. So with every transistor size reduction, we end up with a more power hungry chip if we always use the same die area.
That's why some are arguing heterogeneous computing as a solution, because that way we can use only the transistors needed for a given task, and switch off the rest to avoid "frying-egg" chips. And that's also the reason why Intel is investing on transistor research, trying to make them not only smaller, but much less power hungry than any other else offering.

So here comes the battle. And meanwhile, the programmers are eager to use all that heterogeneity for their programs... but what's out there? CUDA, OpenCL, OpenMP, MPI, HMPP, pthreads?? etc etc etc... Well for some, OpenCL is the only way in the heterogeneous world, since it's an open standard intended for any hardware. But it is based on C... No objects can execute on the GPU... Well, and what about having two different memory spaces?...

HPC users may be happy... you may say... Well, I love all that craziness, but I do know lots of HPC users (Artificial Inteligence, Bioinformatics, Computational Chemistry etc...) that don't!!! They have enough with their science. Don't ask them to learn even about concurrency!! And don't ask them to put some developers to take care of performance and scalability of their code (and heterogeneity? hahaha!!). They don't have an oil company budget, and worst, don't want to understand any thing about concurrency and so on.

So... what is people doing? Well, there are some brave heterogeneous crusaders, that learnt CUDA or OpenCL apart from having a PhD on some HPC-user science. This brainy monsters are an exception, and we can not rely on them to think that the technology will be widely used. We neeeed FUSION!! We need GMAC!! We need x86 virtual memory hardware support, and lots and lots of work that fortunately is already on the works.

For my part, I'm integrating a medical imaging OpenCL code on a visualization interface, and making a library for my own use. It's crazy the amount of code needed for a program if you use OpenCL from scratch. But making a simple code-reducing and generic OpenCL library makes things much shorter and easier. Provably I'll publish it somewhere in Open Source format. Maybe it is useful for others while waiting for a FUSION implementation and so on.

So that's an example of what this blog is about. I'll put whatever I have on my mind and want to share. Even sometimes business aspects of HPC, wearable HPC and so on. At the end it's all related, business drive's the commodity hardware design that is so widely used on HPC.

7 comentarios:

klim820 de junio de 2011, 14:34
OpenCL is not an option. AMD engineers say that it is not meant to be used by regular programmers. Even if you are able to write a program in OpenCL that runs on some GPUs, it is very likely that will run much slower (or not even run) on GPUs with different characteristics (integrated vs discrete) of from a different vendor. For example, NVIDIA does not support OpenCL 1.1 yet, and Intel has just released a beta version of their OpenCL stack. Moreover, device code has to be tuned for each specific architecture. Thus, you end up maintaining different versions for all the kernels and a lot of extra host code to handle all possible hardware combinations... If you want some library that hides the complexity of the host code, currently you only have GMAC. AMD has proposed some extensions for C++ that will automatically generate OpenCL code. But they require features of the next standard (C++0x), which have not been implemented by most compilers yet.

Fusion for HPC..., let's take a look at some memory bandwidth numbers: NVIDIA Tesla C2050 177GB/s, AMD Fusion 21GB/s. Hmmmm, maybe for not-so-high-performance-computing systems :-)
ResponderEliminar
Respuestas
Oscar20 de junio de 2011, 15:29
Wow wow, very interesting.

First, true, OpenCL is not intended to write software as is. But you can do it any way. It depends on how much benefit you get for the pain.

That "much slower" statement... well, I've tested it, not extensively, but with some algorithms and compared to the same code in CUDA. It had almost no performance difference. Maybe, in some cases you can not take all the performance due to some specific features not accessible through OpenCL, but that is a problem of the vendor that has not implemented an appropriate extension.

I'm tensting OpenCL on IGP's and discreet GPU's on linux and Mac from AMD and NVIDIA. The only single problem I found is the amount of memory available on the GPU. You have the same problem on CUDA, if you don't use the unified virtual memory on SDK 4. FUSION (that is not only hardware but software for what I've seen on AMD FUSION Developer Submit) will do the same for AMD and ARM. Both supporting OpenCL. So you will use the FUSION framework, and write de GPU code in OpenCL C.

I had not a single problem with the OpenCL 1.0 to 1.1 difference between AMD and NVIDIA. That is logical. Problems should be noticeable from 1.0 to 2.0, but from 1.0 to 1.1?? Few changes.

Intel? Who mentioned Intel? XD It's their problem. 4 GPU vendors supporting OpenCL is enough (yes NVIDIA, AMD, ARM and Imagination Technologies. Mobile also matters since we will see it on HPC sooner than later). Also, the AMD OpenCL implementation supports both AMD and Intel CPU's with LLVM compiler that automatically tunes at low level, for any of those. Even more, you can run the GPU kernels on the CPU, and Intel has nothing better than CPU's by now.

Device code CAN be tuned for each specific architecture. But it is not necessary. It depends on your needs. The key point here is that you can keep a generic device code, that RUN's on any GPU, and faster than in any CPU ( If you use similar top end CPU's and GPU's, don't compare an IGP to a 12 core Opteron!! It's unfair ). Then, from this generic code, you can tune for a specific architecture, but you don't have to rewrite all the code, or translate any single line of code. You can focus just on the device code. That's the idea. And I think it's great!!

Yes!! I love GMAC!! Something like that is what they have announced with FUSION if I have understood right, and later in hardware support for x86_64 virtual memory for their Next Generation GPU Architecture, so we won't even need GMAC.

When I say FUSION I talk about all that framework for easy coding, GMAC style memory management and so on. Not about a low power APU. But any way, low power matters in HPC?? I think the answer is YES. So maybe not Zacate APU, but some sort of Denver project from NVIDIA? Even high end Bulldozer APU?? Or simply put a non NVIDIA Denver-style ARM Cortex A15 with PowerVR SGX 600MP supercharged version. Even Mali could be useful some day. All that will work with OpenCL. OpenCL is a given, because is an standard. And FUSION is another standard on top of it, embraced by AMD and ARM. Good enough roadmap to stay with OpenCL, in my opinion.
ResponderEliminar
Respuestas
klim826 de junio de 2011, 12:35
I think you have misread some of my statements. I don't say CUDA code is faster than OpenCL code (in fact both CUDA and OpenCL are translated to the same PTX code). I mean that writing code for a NVIDIA GPU is a completely different story than doing it for AMD GPUs (VLIW architecture). At the end you will see software companies shipping many kernel versions and performing run-time device type/vendor detection to select the most appropriate kernel. However, the biggest issue is that, if the device is integrated, the HOST CODE is completely different than the code needed for discrete GPUs (mapping vs allocating + transferring). Even within the discrete GPU scenario, NVIDIA supports up to 8 concurrent kernels running on the GPU, and concurrent execution/data transfer (full duplex with PCIe 2.0). AMD does not support asynchronous data transfers yet. The unified global address space is not going to help with this, if these issues are not hidden by the programming model (neither CUDA nor OpenCL hide these), like DSMs do.

You are being very optimistic with OpenCL. First, it is simply not true that you can run the same kernel in all GPUs (this is not Java): work_group size, local memory size, the presence of constant memory is different in all vendors, and you have to code your kernels with these characteristics in mind. Second, the OpenCL standard is flawed in many aspects, it tries to be low level but adds some high-level abstractions. Per-context allocation instead of per-device allocation? Come on... Out of order command queues? With what scheduling policy? cl_mem instead of pointers!! WTF??, what were they smoking? You either have to create subbuffers to pass an offseted data structure to a kernel, or add an additional argument with the offset. Are we in the 60s? Furthermore, the standard leaves some key implementation details to vendor-dependent implementation. The way to perform "fast" copies between host and GPU memory is completely different in NVIDIA and AMD GPUs; and I am sure that Intel will figure out a third incompatible way to do them...

Finally, if you don't have good support from most of the vendors, nobody is going to use it. And right now, writing OpenCL programs in any platform is a pain in the ass. Respect with the changes between 1.0 and 1.1 (callbacks, subbuffers...), you should see the amount of ifdefs in the GMAC code...

PS: I don't think we will see embedded computing in HPC until they solve the problem I mentioned in my first comment: 177GB/s vs 21GB/s memory bandwidth. And right now, computations are memory-bound. You cannot achieve that rates with less than 200W per card.
ResponderEliminar
Respuestas
Oscar28 de junio de 2011, 4:36
Ok, yes. Writing code for CUDA is faster than writing OpenCL, and more with the GMAC-like Unified Virtual Memory on SDK 4. That is the Host code.

After reading a bit more about the AMD Fusion Developer Submit presentations, the thing is:

AMD wants to solve all that OpenCL problems by offering an open architecture specification (Fusion System Architecture) for getting any vendor CPU and GPU working together in a more integrated manner. Unified and coherent memory spaces, access to CPU interruptions from GPU and recursion for the GPU and a more sophisticated OS scheduling through queues etc.

So ok, that is not today, that is just a roadmap. If you need to build a product based on GPGPU today, CUDA is your choice, absolutely.

On the device side, yes, VLIW on AMD GPU's... But again, if you look at the AMD roadmap, they are removing the VLIW for their "Next Generation Architecture".

So, in case you are just developing software for research purposes, to be able to test your algorithms with several big data-sets (so you need performance), and want your code to be compatible in the future with whatever hardware, and even take a look at different hardware available, or consider the option to use your code on mobile devices with PowerVR SGX graphics and so on, then you do a small library with most common functions in OpenCL, and write your experimental code in OpenCL using your reducing-code-own-library. Well, that was my choice XD

Yes, soon I'll be dealing with the APU and dGPU Host code thing... I'll tell you what I did when I face it XD.

You CAN run exactly the same kernel on different devices, and more if you use the automatic work group size OpenCL option. I did it, and works. The thing is that if you tune too much for a specific device, some optimizations can be preformance-reducing for some other devices, and if you use hardware or vendor specific extensions, you are limiting your hardware choices. You CAN make a device code, that runs on any GPU and even CPU. I did it. If I where to need to run this code on a big machine, for several experiments, and I need extra performance, then I would adapt my actual generic version to the specific hardware with any specific tuning/extensions that help with it. Since then, the generic version is still useful for me. In fact some float4 variables are giving me more performance even on the NVIDIA and on the CPU. Guess it's a memory transfer thing for NVIDIA and vector units on the CPU. So I use float4 for AMD GPU/CPU, NVIDIA GPU, and Intel CPU in some device code parts using exactly the same code. The best thing to do would be to at least have a kernel without local memory for the CPU, but the funny thing is that even using OpenMP on a C version (no vector types), the OpenCL version (with float4 in a critical region where was very easy to use float4) is running few seconds faster, like from 18s to 15s, on the same CPU. Of course, on the AMD GPU the performance increase is bigger than on the NVIDIA GPU. So native vector types on OpenCL can be helpful and easy sometimes.
ResponderEliminar
Respuestas
Oscar28 de junio de 2011, 4:37
Ok, OpenCL is not perfect, and has some problems to overcome. If you don't make money with it, and are optimistic like me, you can suffer a bit and do different Host code versions (I still haven't faced that problem but provably will) while waiting this things to be unified as OpenCL portability policy pretends. So I expect this things to be solved in future OpenCL versions. Even maybe AMD's FSA will help. But you can also try to solve this problems through software. Of course, indexing a cl_mem object is a mess, I don't even imagine a way to make possible to write something like buffer[ my_index ] with a cl_mem object... Maybe modifying the compiler... come on... XD

Ok, it's clear that CPU - GPU memory transfers and stuff like that on OpenCL is BAD and will change, so if you do serious (I mean tuned) portable code you will suffer a lot. From difficulties because of laking features to writing different code. It depends on what you need now.

To sum up. Since I don't know where will my code run in case it is used some day for production, and it won't be tomorrow neither next year, I prefer to use OpenCL and be able to test and run it on any hardware.

Ok, memory bandwidth is a problem. And what happens if you can put more nodes because of the reduced power consumption? You are adding more interconnect overhead. So maybe a supercharged Bulldozer APU would make sense. We'll see what the vendors do, but what is sure is that I won't be able to test all that with CUDA.
ResponderEliminar
Respuestas
klim828 de junio de 2011, 22:41
The Unified Virtual Memory in CUDA 4 is not similar to GMAC. It only ensures that device addresses are unique in the system (an thus device-to-device copies are now allowed), but the programmer still needs to perform an extra allocation in the host and synchronize the contents by hand to access data.

The problems of OpenCL are not related to AMD hardware, but to the specification itself. It is not that OpenCL is not perfect, it is that is broken by design. Adding support for a Unified Address Space, is not going to help if cl_mems are used in most of the functions of the API. I don't see how AMD's FSA is going to help here. And how are memory transfers going to change? They need to keep backward compatibility, so these functions (and the model behind them) will stay. The only solution is to make OpenCL 2.0 incompatible with the 1.x versions and provide a brand new API. Not very likely...

OK, you may be able to code a kernel that runs in all platforms (if you forget about double precision floating point and 64-bit atomics, which are optional). But trying to be too generic can make you lose a lot of performance (several x's). Hence, software developers will have to code many versions of the code. And they will have to do it for the host code, too.

By the way, the performance optimization you are seeing in NVIDIA is due to the reduced code redundancy (your code executes less dynamic instructions) not due to memory bandwidth. You achieve the maximum memory bandwidth either with 16 coalesced "float" loads or with 4 coalesced "float4" loads. The AMD version is greatly benefited because you ARE SUPPOSED to write code like this for their architecture. The comparison you are making with OpenMP is not fair. You have to compare to vectorized code (did tried the Intel compiler?).

A clarification about interrupts: they are going to handle GPU, not CPU, interrupts (and this is only useful in some special cases, because it kills kernel's performance). And I am pretty sure that having coherent memory accesses through a PCIe bus kills performance too.

You can see how painful writing the host code of a real application can be in this presentation: http://code.google.com/p/adsm/downloads/detail?name=RTM-GMAC-GTC2010_final.pptx. And now try to do the same with OpenCL. It is even worse than with CUDA. Note that AMD's OpenCL does not support asynchronous data transfers yet, and you have emulate them by adding extra threads. But wait, NVIDIA supports asynchronous data transfers...

Finally, a CPU (or an APU because it uses the same physical memory) will never have the same memory throughput as a GPU because their needs are different. CPUs need low latency accesses, while GPUs don't care about the latency because they can hide it by scheduling other threads. Look at the memory bandwidth numbers of high-end processors. Nothing close to a discrete GPU.
ResponderEliminar
Respuestas
Oscar4 de julio de 2011, 10:41
Bad for CUDA 4 XD Good for GMAC.

Well, maybe OpenCL 2.x will do the job, maybe not... I don't care, I'm not writing production code XD If I where to need to write production code tomorrow, I would automatically switch to CUDA + GMAC. But I don't know how will it be in few years.

Developers will have to write different kernels if they need extra performance. Maybe the performance of a generic code is enough for their software. It is always a matter of balance between costs and benefits. With OpenCL you can consider it, with CUDA you don't, because all the hardware will be from NVIDIA ( well you can pay a PGI x86 CUDA compiler ). Yes, easier to program, but muuuuch more restrictive too.

Thankyou!! I knew that of 16 float single transfers, but... dind't know how to explain the performance increase.
Yes, I know I'm supposed to write the code like this for AMD, but I could decide that I have no time to invest to adapt the code if it only benefits AMD cards. Maybe having the code running on AMD GPU faster than on the CPU is enough for my purposes. It again depends on the project needs. And for what I understand from AMD slides, they are eliminating the VLIW from their GPU's on the Next Generation Architecture, so you no longer will be supposed to use vector types for AMD GPU's on the future.

Well, when I finish my master courses, maybe I'll be able to understand better the utility of being able to handle GPU interrupts. By now, I can guess that it's a matter of controlling scheduling priorities, to avoid annoying blue circles or color balls. So software responsiveness instead of overall execution times. I guess this is not an HPC friendly feature, but UI and/or System friendly.
Who knows... if they do a good cache coherency implementation, taking in account the PCIe characteristics, maybe it won't loose too much performance. Don't know, but with OpenCL I'll see. Not with CUDA.

I can imagine that implementing GMAC is not very easy. It is very useful, but I guess it needs low low low level programming and deep Operating Systems knowledge. If you add the cl_mem Alien to the equation... I don't want to imagine it XD And having to take in account a greater variety of hardware with different capabilities... well yes it's lskjfgslkjfg-full, but it's also POSSIBLE. That's the point of OpenCL 1.0 or 1.1, and hope next versions make it easier.

Well, even the way a GPU accesses GDDR is different from the way a CPU accesses DDR. Maybe some vendor decides to put GDDR slots on a mainboard for APU's that can connect their GPU part to that, and get performance benefit from it, or maybe some one comes up with a different type of memory sutiable for GPU and CPU at the same time, or a kind of cache that solve's the problem, or a combination of both. Or just, we won't ever see the same memory throughputs on APU's than on discreet GPU's XD
ResponderEliminar
Respuestas

Añadir comentario