Hi
Is nbRows=6784 ?
Anyhow, your problem is here:
gpu.Launch(nbRows, 1, ...
you are launching 6784 blocks of 1 thread each.
Btw, if you were to do the opposite (gpu.Launch(1, nbRows, ...) , in theory it would work, but in practice you wont' be allowed to have more than 1024 (device-dependent) threads per block. Besides the ideal number of threads per block is something that needs fine-tuning. Finally, using only 1 block means using only a small% of your gpu capacity.
You should combine threads with blocks and within your kernel compute the overall index of your row based on thread and block id's. There are many such cases within the cudafy examples.
Is nbRows=6784 ?
Anyhow, your problem is here:
gpu.Launch(nbRows, 1, ...
you are launching 6784 blocks of 1 thread each.
Btw, if you were to do the opposite (gpu.Launch(1, nbRows, ...) , in theory it would work, but in practice you wont' be allowed to have more than 1024 (device-dependent) threads per block. Besides the ideal number of threads per block is something that needs fine-tuning. Finally, using only 1 block means using only a small% of your gpu capacity.
You should combine threads with blocks and within your kernel compute the overall index of your row based on thread and block id's. There are many such cases within the cudafy examples.