Simulating a trivial example, the FFT plans creation
phase takes 126.52 seconds while the simulation itself takes less than a second. That can't be right... The same simulation runs in less than a second total in MATLAB.
I get the following output
+---------------------------------------------------------------+
| kspaceFirstOrder-CUDA v1.3 |
+---------------------------------------------------------------+
| Reading simulation configuration: Done |
| Selected GPU device id: 0 |
| GPU device name: NVIDIA GeForce RTX 3090 |
| Number of CPU threads: 1 |
| Processor name: Intel(R) Core(TM) i7-10700K CPU @ 3.80GHz |
+---------------------------------------------------------------+
| Simulation details |
+---------------------------------------------------------------+
| Domain dimensions: 128 x 108 |
| Medium type: 2D |
| Simulation time steps: 793 |
+---------------------------------------------------------------+
| Initialization |
+---------------------------------------------------------------+
| Memory allocation: Done |
| Data loading: Done |
| Elapsed time: 0.01s |
+---------------------------------------------------------------+
| FFT plans creation: Done |
| Pre-processing phase: Done |
| Elapsed time: 126.52s |
+---------------------------------------------------------------+
| Computational resources |
+---------------------------------------------------------------+
| Current host memory in use: 408MB |
| Current device memory in use: 1300MB |
| Expected output file size: 0MB |
+---------------------------------------------------------------+
| Simulation |
+----------+----------------+--------------+--------------------+
| Progress | Elapsed time | Time to go | Est. finish time |
+----------+----------------+--------------+--------------------+
| 0% | 0.001s | 0.396s | 05/07/22 16:52:22 |
| 5% | 0.014s | 0.257s | 05/07/22 16:52:22 |
| 10% | 0.029s | 0.255s | 05/07/22 16:52:22 |
| 15% | 0.044s | 0.247s | 05/07/22 16:52:22 |
| 20% | 0.058s | 0.229s | 05/07/22 16:52:22 |
| 25% | 0.071s | 0.211s | 05/07/22 16:52:22 |
| 30% | 0.086s | 0.199s | 05/07/22 16:52:22 |
| 35% | 0.100s | 0.184s | 05/07/22 16:52:22 |
| 40% | 0.114s | 0.169s | 05/07/22 16:52:22 |
| 45% | 0.130s | 0.158s | 05/07/22 16:52:22 |
| 50% | 0.143s | 0.142s | 05/07/22 16:52:22 |
| 55% | 0.159s | 0.129s | 05/07/22 16:52:22 |
| 60% | 0.173s | 0.115s | 05/07/22 16:52:22 |
| 65% | 0.185s | 0.099s | 05/07/22 16:52:22 |
| 70% | 0.203s | 0.086s | 05/07/22 16:52:22 |
| 75% | 0.217s | 0.072s | 05/07/22 16:52:22 |
| 80% | 0.231s | 0.057s | 05/07/22 16:52:22 |
| 85% | 0.245s | 0.042s | 05/07/22 16:52:22 |
| 90% | 0.259s | 0.028s | 05/07/22 16:52:22 |
| 95% | 0.273s | 0.014s | 05/07/22 16:52:22 |
+----------+----------------+--------------+--------------------+
| Elapsed time: 0.29s |
+---------------------------------------------------------------+
| Sampled data post-processing: Done |
| Elapsed time: 0.01s |
+---------------------------------------------------------------+
| Summary |
+---------------------------------------------------------------+
| Peak host memory in use: 408MB |
| Peak device memory in use: 1300MB |
+---------------------------------------------------------------+
| Total execution time: 128.07s |
+---------------------------------------------------------------+
| End of computation |
+---------------------------------------------------------------+
Any idea how to make the C++ binary faster?