...when it is possible to utilize more than 1500 processors on latest NVIDA cards for processing instead of just 4 or 8 cores on main CPU.
Just to level expectations - just because modern GPUs have 1500+ cores doesn't mean that you'll gain a 375x increase in performance over a quad core machine by utilising them.
Sure, I was not telling that using additional 1500 CUDA cores compared to 4 main CPU cores would give 375x increase in performance. Single CUDA core and single main CPU cores have different processing power and available to them resources and their main target applications are different.
What I meant that using full CUDA processing resources using all available CUDA cores could provide drastic improvement in performance for RAW image processing. Especially could be useful for image denoising which could be splitted into huge number of parallel processes using separate process for each small image block.
If on my laptop PRIME DXO denoising takes about 150 seconds for A7R, 80 seconds for 1DX and ~23 seconds for a7s RAW files and I see this process is using all 4 main CPU cores (8 threads) on laptop up to 100% and boosts CPU clocking up to 3.4Ghz and CPU temperature jumps to 100C then using 1536 cores of NVIDIA GTX 780M could provide drastic performance improvement and reduce load on main CPU. Even 10X better performance would result in 15sec for a7R, 8 sec for 1Dx and 2 sec for a7S with prime denoising and I believe that could be even much better fully utilizing all CUDA cores
When I started this topic I provided several inks to the very interesting presentation regarding the subject, one is NDIVA presentation and the other one of the very impressive real time embedded image processing implementations. That papers shows what level of performance improvement could be achieved using CUDA technology for image and video processing directly from RAW file .
Benchmark results for Fastvideo industrial cameras implementation with real time processing are really amazing:http://www.fastcompression.comhttp://on-demand.gputechconf.com/gtc/2014/presentations/S4728-gpu-image-processing-camera-apps.pdf
Final Benchmark on GPU (Titan)
CMOSIS image sensor CMV20000, 5120x3840 (~20mpx), 12-bit, 30 fps
GeForce GTX Titan GPU
Host to device transfer ~1.5 ms
Demosaic ~3.1 ms
JPEG encoding (90%, 4:2:0) ~7.8 ms
Device to Host transfer ~1.3 ms
Total: ~13.7 ms
P.S. This is the benchmark for PCIE camera CB-200 from XIMEA
Solution for Photo Hosting
Task description: load-decode-resize-encode-store
Image load ~1.5 ms for 2048x2048 jpg image
JPEG decoding ~3.4 ms
Downsize to 1024x1024 with bicubic algorithm ~0.7 ms
JPEG encoding (quality 90%, 4:4:4) ~3.4 ms
Image store ~1.0 ms
GPU processing time ~7.5 ms
Total time ~10 ms
I wish to see that level of performance in products that I currently use, especially in some new LR release.
And more info in NVIDAI presentation:http://on-demand.gputechconf.com/siggraph/2013/presentation/SG3108-GPU-Programming-Video-Image-Processing.pdf