Yes, I've been using dynamic parallelism successfully but (I think) the current support in Cudafy does not allow you to specify a stream for the child kernel.
The Cuda programming guide appears to suggest that you can create streams in device code and then launch child kernels asynchronously using those streams. I'm fairly certain that would get some good speed up (at least in my case).
I'd be willing to help with implementing this - do you guys have any high level code architecture for Cudafy that would accelerate any implementation I did? e.g. the code base is quite large, I could do with some pointers!
The Cuda programming guide appears to suggest that you can create streams in device code and then launch child kernels asynchronously using those streams. I'm fairly certain that would get some good speed up (at least in my case).
I'd be willing to help with implementing this - do you guys have any high level code architecture for Cudafy that would accelerate any implementation I did? e.g. the code base is quite large, I could do with some pointers!