Motivation

In the rising market of portable devices, thinness and power consumption are two main considerations when producing a competitive and successful device, for example, a cell phone or a table. A negative side effect of this constraints is a smaller camera module, which in turn has lenses with small apertures and sensors with tiny pixels. Both these characteristics limit the amount of light that is sensed by the camera, resulting in a reduced quality of photographs under adverse conditions, such as, low light environments and small exposure times when dealing with fast moving objects. This reduced quality appears in the image as increased motion blur, lack of contrast and increased noise, all non-idealities that limit the accuracy of scene recognition.

Automatic scene recognition based on the information present in an image is an important task for portable devices with many practical applications in portable devices, for example, face detection on a smart phone. This way, the corrupted image needs to be restored in order to achieve better results of recognition, which motivates the use of an image restoration algorithm to process the image before the recognition task. Hence, it is of major importance to have fast and effective restoration of an image in order to process the recognition task in real time.

The image restoration task is a heavily researched subject, with many algorithms being proposed in the last years. However, most of them are highly computational demanding and take a large amount of time to produce a restored image, when implemented in software on a general purpose CPU. This motivates the implementation of specialized hardware to deal with a corrupted image in real time, producing the restored image in the fastest time possible and with enhanced quality.

In this work, a computationally inexpensive, low power and real time image denoising approach will be presented. The image denoising method enhances the quality of photographs taken by a camera and, consequently, the accuracy for scene recognition. A novel hardware architecture will be presented to implement the image restoration algorithm using an FPGA.

Goals

Image denoising algorithms can be implemented using general purpose CPUs, GPUs or specialized cores. The easiest solution is the implementation in a high level language such as C or C++ in a CPU, however it is also the slowest. A GPU is specialized to deal with graphics and therefore it is more adequate to use as an image processor. On the other hand, most GPUs consume a lot of power when faced with the kind of tasks posed by most denoising algorithms. Therefore, the best solution is the development of a specialized core, able to denoise an image in real time and with low power consumption, without compromising the quality of the restored image.

With the reduction of costs in the fabrication of CMOS circuits, the FPGA platforms are a very appealing solution for the fast prototyping of novel hardware implementations. Hence, the main objective of this work is to develop a fully specialized core for image denoising with an FPGA. It is also an important goal that this implementation dramatically enhances the quality of images and, hence, the quality of classification in real time, with a low power consumption.

System Description

Dabov et.al [1] proposed in 2007 a novel method for image denoising based on collaborative filtering in transform domain. This algorithm is called block matching and 3D filtering (BM3D) and comprises three major steps. First, a set of similar 2D image fragments (i.e. blocks) is grouped into 3D data arrays that are referred to as groups. This step is referred to as block matching. Second, a 3D transform is applied to the groups, resulting in a sparse representation, that is filtered in the transform domain, and after inversion of the transform, produces the noise-free predicted blocks. This step is referred to as collaborative filtering. Finally, the predicted noise-free blocks are returned to their original positions to form the recovered image. BM3D relies on the effectiveness of the block matching and collaborative filtering to produce good denoising results.

The BM3D algorithm is comprised of two "runs" of the aforementioned steps. First, the noisy image is processed using the block matching, collaborative filtering and aggregation, using hard thresholding in the shrinkage of the transform coefficients. This produces a basic estimate for the original noise free image. Then, using this basic estimate as input, block matching is applied, being more accurate because the noise is already significantly attenuated. The same groups formed in this basic estimate are formed in the original image. Then, the collaborative filtering and aggregation is applied, but wiener filtering is used instead of hard thresholding for the shrinkage. The wiener filter assumes that the basic estimate energy spectrum is the true energy spectrum of the image, and allows for a more efficient filtering than hard thresholding, improving the final image quality.

In order to achieve the aforementioned goals, a specialized hardware architecture for image denoising using the BM3D algorithm is developed in this work. The system level architecture can be seen in the following image.


The main blocks of the architecture are the array of matching processors and the denoising path. Each matching processor is responsible for the group matching step for the currently processed coordinate, and by using N processors in parallel, multiple coordinates can be processed at the same time, effectively speeding up the algorithm. After the matching is complete, the groups formed are filtered by the denoising path. This path starts with the FIFO that holds the values of the coordinates of each grouped block, and the blocks responsible for applying the DCT and Haar transforms to the group. Then, there is the hard thresholding and wiener filter blocks, which filter the image. The final step is achieved by the inverse Haar and DCT blocks, which compute the filtered image patches from each group.

Results

The BM3D system was tested in the ZYNQ-7000 ZC706 SoC evaluation board from Xilinx. This board contains the XC7Z045 system on chip from Xilinx, which is divided into two sections: the Processing System (PS) and the Programmable Logic (PL).


As can be seen on the table above, the BM3D system needs a high number of hardware resources, due to the complexity of the operations performed. However, the highest utilization percentage is for LUTs at around 41%, meaning that the ZYNQ PL is under 50% utilization in all resources. Nevertheless, the ZYNQ PL contains a large number of resources, which can be misleading when analyzing the occupation of the BM3D system. The high number of LUTs is due to the arithmetic operations performed in both the matching processor's L1 norm units and the denoising pipeline's 2D DCT and IDCT modules. The flip-flop utilization is mainly due to the pipeline stages that are used in the whole architecture, while the memory LUT usage can be explained by all the special memories included in the design, that cannot be synthesized in BRAMs. The input and output BRAM's occupy 40.5 of the BRAMs in the PL, while only 1 out of 8 MMCMs is used.


The graphs above show PSNR and SSIM measurements for a set of seven 512x512 images. The results show that both measurements for the hardware implementation have a relatively large decrease when compared to the original BM3D algorithm for low noise powers, with a slight increase when compared with the noisy image measurements. However, as the noise power increases, the results of the hardware implementation start to get on par with those of the original BM3D, and both of them are extremely larger when compared to the noisy image. The PSNR difference between both implementations ranges from an absolute maximum of 2.48 dB for noise power of 5 to a minimum of 0.81 for noise power of 50, which corresponds, respectively to 6.58% and 2.97% in relative difference. Regarding the SSIM measurements, the difference ranges from a maximum of 0.046 for noise power of 15 to a minimum of 0.011 for noise power of 50, which corresponds, respectively to 5.46% and 1.61% in relative difference.


The tables above show PSNR (left) and SSIM (right) measurements for images of increasing resolutions: 1MP - 1280x960, 2MP - 1920x1080, 3MP - 2048x1536, 5MP - 2560x1920 and 8MP - 3264x2448. Results show that for larger image resolutions the BM3D system continues to under-perform the original implementation's PSNR results at about 1 to 2 dB. However, for high noise powers, it can be seen that the SSIM values achieved by the BM3D system surpass the ones obtained by the software implementation.


The table above contains the execution times for images of increasing resolutions with constant noise power of 20, running on the ZYNQ board, with the BM3D system operating at 50,75,100 and 125 MHz, as weel as execution times on two CPUs: CPU1, which is an Intel i5-3317U processor, running at 1.7 GHz, and CPU2, an Intel i5-4570 processor, running at 3.2 GHz. Better results are expected from CPU2, since it is a desktop processor, with more processing power and running at a higher frequency, while CPU1 is a low power laptop processor. As can be seen by these results, the BM3D system is significantly faster than both CPUs tested. To better evaluate these results, the speedup value is computed, which is simply the division of the execution time on the CPUs and on the ZYNQ board. This value was computed for all frequencies of operation and comparing with both CPUs and the results are presented in the graphs bellow.


As can be seen in the graphs, the profile of the speedup results is quite similar, with an almost monotonic increase with the resolution of the image. For the BM3D system running at 50 MHz (blue) and comparing with CPU1, the speedup values range from 20.6 to 23.0, averaging 21.4. For increasing frequencies, the values are even higher: average of 31.4 for 75 MHz (grey), 41.4 for 100 MHz (orange) and 51.1 for 125 MHz (yellow). When comparing with CPU2, the values are smaller, due to the higher performance of this processor. Even so, the speedup ranges from 12.3 to 14.3 in the slower frequency of 50 MHz, up to 29.3 to 34.4 for the highest frequency (125 MHz). These results are very satisfactory, since the ZYNQ board runs at a much lower frequency than the CPUs. It is also worth noting that as the image resolution increases, the speedup also increases, meaning that the BM3D system can handle bigger images with more ease than the CPUs.


The figure above shows the power consumption distribution of the BM3D system for the different synthesized frequencies. For all frequencies, power consumption is between 2 and 3 Watts, which is low considering the amount of FPGA resources used and the complexity of the BM3D algorithm. Furthermore, it is possible to see that a good distribution of dynamic versus static power is achieved, with 89 to 91% of dynamic power and 11 to 9% static power. The static power increase with frequency is completely negligible, with only an increase of 4 mW between 50 and 125 MHz. On the other hand, the dynamic power increases fairly with the frequency, which amounts for the total increase of the power consumption. This increase is perceptible in the Clocks, Signals and Logic categories of the dynamic power, whilst the BRAM and MMCM modules power usage stays constant. This is due to the fact that the increased frequency is applied to the BM3D IP core, and not to the remainder of the design (AXI Lite, BRAM, PS), meaning that all the signals and logic will have increased switching, therefore consuming more power.


The table above shows the separation of the power consumption of the PS and the PL of the ZYNQ board for the different synthesized frequencies.

Conclusions

In this work a novel hardware implementation for the BM3D image denoising algorithm was presented. Taking advantage of an FPGA, the BM3D IP core accelerates the process of image denoising, allowing for excellent denoising performance, whilst having a low power consumption. Considering its complexity, this work was developed in multiple phases.

The first phase was an extensive study of the background and state of the art in image denoising in order to understand and compare the BM3D denoising algorithm with others. The second phase was developing the BM3D algorithm in MATLAB, in order to gain some experience with image denoising and to further analyze in full extent the bottlenecks of the algorithm. Then, the third phase consisted of adjusting or modifying these bottlenecks in order to optimize these steps towards an efficient hardware implementation. Followed the most extensive phase of the work, which is the actual development of the hardware. This consisted in a sequential development of various hardware blocks, followed by their simulation, validation and testing. Then, the connection of all blocks to create the BM3D IP was performed, which eventually lead to more simulations and some design tweaks in order to achieve a fully working system. Finally, the last phase was the final testing of the system with noisy images, and the evaluation of several performance parameters: denoising performance by calculating the PSNR and SSIM measurements; run time comparison between hardware and software implementations of the algorithm; and power consumption of the hardware implementation.

Experimental results show that the BM3D IP can effectively denoise images of various resolutions and with a wide range of noise powers. Furthermore, although denoising performance is negatively affected when compared to the software implementation of the BM3D algorithm, the gains in execution time and power consumption are significant enough to consider the BM3D IP as an alternative for image denoising in software.

Future Work

Despite the fact that the results achieved by the BM3D system are quite satisfactory, there are some improvements worth mentioning as possibilities for future work.

The first improvement is the hardware implementation of the wiener filter in order to further improve the denoising results, specially in terms of SSIM. The design of this new implementation was already analyzed and involves the development of the wiener filter block, which includes implementing a division operation. This block would then be included in parallel to the hard thresholding block in the denoising pipeline currently implemented in the BM3D system and data would be multiplexed in order to choose which denoising technique to use. The second improvement would be increasing the parallelization of the data in the denoising pipeline, so that more levels of the haar wavelet decomposition could be applied. This improvement would lead to increased denoising performance, while also decreasing execution time. On the negative side, implementation area would increase dramatically.

The third improvement regards an extension of the system to color images. As proposed in the original paper, the BM3D algorithm can also be applied to color images, by converting an RGB colorspace to an YCbCr one, for example. In this colorspace, the Y channel is the luminance and Cb and Cr are the chrominance channels. With this separation, the BM3D is applied by performing block matching on the Y channel, and reusing the groups formed on all three channels in order to perform collaborative filtering separately on each channel. Regarding an hardware implementation, the same structure as the current BM3D system could be used, with an additional modification that would allow to skip the block matching step for the Cb and Cr channels. This way, on a first run, the Y channel would be processed exactly as a grayscale image, and all the group positions would be kept. Then, the Cb and Cr channels would be consecutively processed, using the groups previously formed and skipping the block matching step. However, this processing would eventually triple the execution time, since the size of the image data is also the triple of a grayscale image.