offloading every damn computation to the GPU doesn't make your code 'parallelized', it just makes it slow and a pain to debug. How about optimizing the CPU code instead of relying on a 1000-page CUDA manual?