split in API specific classes for Cuda/OpenCL
this to
1) it's cleaner
2) it avoids pulling in openCL code in cuda classes which leads
to clashes between nvidia headers and opencl.hpp
there is still too much API specific things in interface between the
bda components to work through a virtual interface so we still have to cast
to the relevant implementation in various places.