The BM3D system was tested in the ZYNQ-7000 ZC706 SoC evaluation board from Xilinx. This board contains the XC7Z045 system on chip from Xilinx, which is divided into two sections: the Processing System (PS) and the Programmable Logic (PL).
As can be seen on the table above, the BM3D system needs a high number of hardware resources, due to the complexity of the operations performed. However, the highest utilization percentage is for LUTs at around 41%, meaning that the ZYNQ PL is under 50% utilization in all resources. Nevertheless, the ZYNQ PL contains a large number of resources, which can be misleading when analyzing the occupation of the BM3D system. The high number of LUTs is due to the arithmetic operations performed in both the matching processor's L1 norm units and the denoising pipeline's 2D DCT and IDCT modules. The flip-flop utilization is mainly due to the pipeline stages that are used in the whole architecture, while the memory LUT usage can be explained by all the special memories included in the design, that cannot be synthesized in BRAMs. The input and output BRAM's occupy 40.5 of the BRAMs in the PL, while only 1 out of 8 MMCMs is used.
The graphs above show PSNR and SSIM measurements for a set of seven 512x512 images. The results show that both measurements for the hardware implementation have a relatively large decrease when compared to the original BM3D algorithm for low noise powers, with a slight increase when compared with the noisy image measurements. However, as the noise power increases, the results of the hardware implementation start to get on par with those of the original BM3D, and both of them are extremely larger when compared to the noisy image. The PSNR difference between both implementations ranges from an absolute maximum of 2.48 dB for noise power of 5 to a minimum of 0.81 for noise power of 50, which corresponds, respectively to 6.58% and 2.97% in relative difference. Regarding the SSIM measurements, the difference ranges from a maximum of 0.046 for noise power of 15 to a minimum of 0.011 for noise power of 50, which corresponds, respectively to 5.46% and 1.61% in relative difference.
The tables above show PSNR (left) and SSIM (right) measurements for images of increasing resolutions: 1MP - 1280x960, 2MP - 1920x1080, 3MP - 2048x1536, 5MP - 2560x1920 and 8MP - 3264x2448. Results show that for larger image resolutions the BM3D system continues to under-perform the original implementation's PSNR results at about 1 to 2 dB. However, for high noise powers, it can be seen that the SSIM values achieved by the BM3D system surpass the ones obtained by the software implementation.
The table above contains the execution times for images of increasing resolutions with constant noise power of 20, running on the ZYNQ board, with the BM3D system operating at 50,75,100 and 125 MHz, as weel as execution times on two CPUs: CPU1, which is an Intel i5-3317U processor, running at 1.7 GHz, and CPU2, an Intel i5-4570 processor, running at 3.2 GHz. Better results are expected from CPU2, since it is a desktop processor, with more processing power and running at a higher frequency, while CPU1 is a low power laptop processor. As can be seen by these results, the BM3D system is significantly faster than both CPUs tested. To better evaluate these results, the speedup value is computed, which is simply the division of the execution time on the CPUs and on the ZYNQ board. This value was computed for all frequencies of operation and comparing with both CPUs and the results are presented in the graphs bellow.
As can be seen in the graphs, the profile of the speedup results is quite similar, with an almost monotonic increase with the resolution of the image. For the BM3D system running at 50 MHz (blue) and comparing with CPU1, the speedup values range from 20.6 to 23.0, averaging 21.4. For increasing frequencies, the values are even higher: average of 31.4 for 75 MHz (grey), 41.4 for 100 MHz (orange) and 51.1 for 125 MHz (yellow). When comparing with CPU2, the values are smaller, due to the higher performance of this processor. Even so, the speedup ranges from 12.3 to 14.3 in the slower frequency of 50 MHz, up to 29.3 to 34.4 for the highest frequency (125 MHz). These results are very satisfactory, since the ZYNQ board runs at a much lower frequency than the CPUs. It is also worth noting that as the image resolution increases, the speedup also increases, meaning that the BM3D system can handle bigger images with more ease than the CPUs.
The figure above shows the power consumption distribution of the BM3D system for the different synthesized frequencies. For all frequencies, power consumption is between 2 and 3 Watts, which is low considering the amount of FPGA resources used and the complexity of the BM3D algorithm. Furthermore, it is possible to see that a good distribution of dynamic versus static power is achieved, with 89 to 91% of dynamic power and 11 to 9% static power. The static power increase with frequency is completely negligible, with only an increase of 4 mW between 50 and 125 MHz. On the other hand, the dynamic power increases fairly with the frequency, which amounts for the total increase of the power consumption. This increase is perceptible in the Clocks, Signals and Logic categories of the dynamic power, whilst the BRAM and MMCM modules power usage stays constant. This is due to the fact that the increased frequency is applied to the BM3D IP core, and not to the remainder of the design (AXI Lite, BRAM, PS), meaning that all the signals and logic will have increased switching, therefore consuming more power.
The table above shows the separation of the power consumption of the PS and the PL of the ZYNQ board for the different synthesized frequencies.