Quantcast
Channel: CUDAfy.NET
Viewing all articles
Browse latest Browse all 1169

New Post: Roslyn + NVRTC

$
0
0
My understanding of NSight with NVVM is that the debugger just works, provided you give it the proper DWARF annotations in the generated NVVM IR, and figure out how to point it to the right file for the source code. This means it should be possible to step through the C# code directly, without any C intermediate. QuantAlea can debug F# running on the GPU this way, for instance.

As far as optimizations, unless something goes very wrong, C# should optimize very similarly to C since the languages are so similar. If anything, the decompilation step as it's currently implemented would be the thing that could break this if it runs into a case where it can't regenerate some bit of code cleanly. NSight looks like it has awesome profiling tools, by the way, at least as of Cuda 7.5.





Memory management on the GPU is a very interesting topic - OpenCL doesn't support it, but Cuda has for some time, since Fermi I believe. Basically, there are a number of problems which are parallel and have large problem sizes, but need dynamic memory allocation to work well. An example of this is constructing a BVH for ray tracing. It's a hefty amount of work (maybe you have 100 million triangles in the scene...), and may have to be entirely redone every frame, depending on how many things are moving in the scene. Also, if you want it anything close to real time, you'd better not have to transfer 3 GB of BVH to the GPU each frame when it's rebuilt! Thing is, a building a BVH is essentially building a tree, so you either need malloc or you need to do it yourself on some flat buffer that you guessed the size of before you started.

As for a full GC, well, I'll agree that at this point, it's not needed or even wanted. However, I suspect that more and more of a program will eventually end up on the GPU as libraries begin to show up. There are some good reasons to move things over - latency between kernel launches is a big one, if you need small kernels with synchronization between them. The other big one is avoiding transferring data between host and device and more than absolutely necessary. It's often a win to run serial code on the GPU if the alternative is a data transfer, or an execution bubble. GPUs aren't that much slower than CPUs on serial code - I'd suspect within an order of magnitude. Who knows if a full managed runtime will ever show up in GPU land in the next decade or two?

Viewing all articles
Browse latest Browse all 1169

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>